LegUp: Open-Source High- Synthesis Research Framework

by

Andrew Christopher Canis

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto

c Copyright 2015 by Andrew Christopher Canis Abstract LegUp: Open-Source High-Level Synthesis Research Framework

Andrew Christopher Canis Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2015

The rate of increase in computing performance has been slowing due to the end of processor fre- quency scaling and diminishing returns from multiple cores. We believe the industry is heading towards heterogeneous computing, an accelerator era, where specialized hardware is harnessed for better power efficiency and compute performance. A natural platform for these accelerators are field-programmable gate arrays (FPGAs), which are integrated circuits that can implement large custom digital circuits including complete system-on-chips. However, programming an FPGA can be an arduous undertaking even for experienced hardware engineers. We propose raising the abstraction level by allowing a designer to incrementally move their design from a processor to a set of hardware accelerators, each automat- ically synthesized from a software implementation. This dissertation describes LegUp, an open-source high-level synthesis (HLS) framework that enables this new design methodology. We further present novel improvements to the quality of the synthesized circuits when targeting FPGAs. First, we present the LegUp high-level synthesis framework with an overview of our design flow. The software is unique among academic tools for offering a wide support of the ANSI C software language, for targeting a hybrid processor/accelerator architecture, and for being open-source. We also show that the quality of results produced by LegUp are competitive with a commercial HLS tool. Next, we present an FPGA architecture-specific HLS resource sharing approach. Our technique multi-pumps high-speed DSP blocks on modern FPGAs by clocking them at twice the system clock frequency. We show that multi-pumping can reduce circuit area without impacting performance. Following this, we describe a novel loop pipeline scheduling algorithm. Our approach handles complex constraints by using a backtracking method to discover better scheduling possibilities. This scheduling algorithm improves throughput for complex loop pipelines compared to prior work and a commercial tool. Finally, we examine LegUp’s target memory architecture and describe how to partition memory within the circuit hierarchy using information from compiler alias analysis. We also present a method to efficiently use the block RAMs present in modern FPGAs by grouping memories together. These techniques decrease memory usage and improve performance for our HLS-generated circuits.

ii Acknowledgements

There have been many people involved in the LegUp project of whom I was immensely lucky and grateful to work with over the years. This dissertation would not have been possible without my two incredible supervisors and mentors. I would like to thank my co-advisor, Jason Anderson, for his guidance and mentorship throughout my studies. Jason dedicated significant time to the LegUp project, spending many hours in meetings, recruiting students, organizing tutorials and spreading the word about LegUp. I admire your work ethic and I have vastly improved my ability to write and conduct research by learning from your example. Also thanks to my co-advisor, Stephen Brown, for your high-level vision and candid advice, and for giving me the flexibility to follow my own research path. Thanks to the members of my committee, Vaughn Betz, Jianwen Zhu, and Andreas Koch for their edits and feedback on this work. I would like to thank all the other graduate students involved with the LegUp project. I was lucky to work with such a smart team: Blair Fort, Ruo Long (Lanny) Lian, Nazanin Calagar, Li Liu, Marcel Gort, Bain Syrowik, Joy (Yu Ting) Chen, and Julie Hsiao. In particular, I wanted to thank Jongsok (James) Choi with whom I spent many long nights debugging signal waveforms and improving LegUp. Also Mark Aldham for working on the initial version of LegUp and running power simulations. Thanks to all the LegUp summer undergraduate students: Victor Zhang, Ahmed Kammoona, Stefan Hadjis, Kevin Nam, Qijing (Jenny) Huang, Ryan Xi, Emily Miao, Yolanda Wang, Yvonne Zhang, William Cai, and Mathew Hall who were all a joy to work with and pushed the LegUp project further. Thanks to all the other graduate students from Pratt 392 especially Mehmet Avci, Jason Luu, and Braiden Brousseau for your many entertaining discussions over the years. Thanks for the feedback from Altera employees Tomasz Czajkowski and Deshanand Singh who gave some initial guidance for this research direction and for Altera’s funding of the project. I would also like to thank Philippe Coussy, Daniel Gajski, and Jason Cong for organizing a fascinating tutorial that I attended at DAC in 2009, which influenced the work here. Also I am grateful to CMC for providing us with Modelsim licenses. Special thanks to the dependable administrative support from Kelly, Judith, and Darlene. I also appreciated the inspiring entrepreneurship talks and dinners organized by professor Jonathan Rose. I am grateful to the Canadian government for their generous scholarships through the Natural Sci- ences and Engineering Research Council and the Ontario Graduate Scholarship. I thank the Rogers Family for their generous scholarships and for supporting the ECE faculty. Thanks to my friends and roommates for all the fun outside of school over the past six years, especially: Adam, Michael, Paul, Mark, and Alex. I am truly grateful for the loving support of my parents, Anne and Frank, and my brothers: Lloyd, Stephen, and Ian. Thanks for believing in me, supporting my education, and teaching me to always try my best. Finally, thanks to Sabrina for all the love, support, and constant thoughtfulness!

iii Our grand business undoubtedly is, not to see what lies dimly at a distance, but to do what lies clearly at hand.

— Thomas Carlyle

iv Contents

1 Introduction 1 1.1 ResearchMotivation ...... 3 1.2 ResearchContributions ...... 5 1.3 Organization ...... 6

2 Background and Related Work 7 2.1 Introduction...... 7 2.2 ModernComputationPlatforms ...... 7 2.3 High-LevelSynthesisFlow...... 8 2.4 C Compiler: Low-Level Virtual Machine (LLVM) ...... 9 2.5 Allocation...... 11 2.6 Scheduling...... 13 2.6.1 SDCScheduling ...... 15 2.6.2 Extracting Parallelism ...... 17 2.7 Binding ...... 17 2.8 FPGAArchitecture ...... 19

3 LegUp: Open-Source High-Level Synthesis Research Framework 21 3.1 Introduction...... 21 3.2 Background...... 22 3.2.1 PriorHLSTools ...... 22 3.2.2 Application-Specific Instruction-Set Processors (ASIPs) ...... 24 3.3 LegUpOverview ...... 25 3.3.1 DesignMethodology ...... 25 3.3.2 TargetSystemArchitecture ...... 26 3.4 LegUpDesignandImplementation ...... 28 3.4.1 HardwareModules ...... 29 3.4.2 Device Characterization ...... 30 3.4.3 HardwareProfiling...... 31 3.4.4 HybridProcessor/AcceleratorSystem ...... 31 3.4.5 LanguageSupportandBenchmarks ...... 32 3.4.6 CircuitCorrectness...... 34 3.4.7 Extensibility of LegUp to Other FPGA Devices ...... 35 3.5 ExperimentalStudy ...... 36

v 3.5.1 ExperimentalResults ...... 37 3.5.2 ComparisontoCurrentLegUpRelease...... 44 3.6 ResearchusingLegUp ...... 44 3.7 Summary ...... 46

4 Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 48 4.1 Introduction...... 48 4.2 Background...... 49 4.3 Multi-Pumped Multiplier Units: Concept and Characterization ...... 50 4.3.1 Multi-Pumped Multiplier Characterization ...... 51 4.3.2 Multi-Pumping vs. Resource Sharing ...... 53 4.4 Multi-Pumping DSPs in High-Level Synthesis ...... 54 4.4.1 DSPInferencePrediction ...... 54 4.5 ExperimentalStudy ...... 55 4.6 Summary ...... 57

5 Modulo SDC Scheduling with Recurrence Minimization in HLS 58 5.1 Introduction...... 58 5.2 Preliminaries ...... 59 5.2.1 RelatedWork...... 59 5.2.2 Background: Loop Pipeline Modulo Scheduling ...... 61 5.2.3 Background: Loop Pipeline Hardware Generation ...... 64 5.3 Motivation ...... 65 5.3.1 Greedy Modulo Scheduling Example ...... 65 5.4 ModuloSDCScheduler ...... 66 5.4.1 Detailed Scheduling Example ...... 69 5.4.2 Complexity Analysis ...... 70 5.5 LoopRecurrenceOptimization ...... 70 5.6 ExperimentalStudyandResults ...... 73 5.6.1 Runtime Analysis ...... 77 5.7 Summary ...... 77

6 LegUp: Memory Architecture 80 6.1 Introduction...... 80 6.2 Background...... 81 6.2.1 RelatedWork...... 81 6.2.2 Alias and Points-to Analysis ...... 82 6.3 LegUpMemoryArchitecture ...... 83 6.3.1 Overview ...... 83 6.3.2 GlobalMemoryBlocks...... 85 6.4 LocalMemoryBlocks ...... 89 6.5 GroupedMemories...... 91 6.5.1 Grouped Memory Allocation ...... 93 6.6 ExperimentalStudy ...... 96

vi 6.7 Summary ...... 100

7 CaseStudy:LegUpvsHardwareDesignedbyHand 101 7.1 Introduction...... 101 7.2 Background...... 101 7.2.1 HLSvsHandRTL...... 101 7.2.2 SobelFilter...... 102 7.3 CustomHardwareImplementation ...... 103 7.4 LegUpImplementation...... 104 7.5 ExperimentalStudy ...... 107 7.6 Summary ...... 108

8 Conclusions 109 8.1 SummaryandContributions...... 109 8.2 FutureWork ...... 110 8.2.1 ExtensionsofthisResearchWork...... 111 8.2.2 ImprovementstoLegUp ...... 112 8.2.3 Additional High-Level Synthesis Research Directions ...... 113 8.3 ClosingRemarks ...... 113

References 114

A LegUp Source Code Overview 128 A.1 LLVMBackendPass ...... 128 A.2 LLVMFrontendPasses...... 129

vii List of Tables

3.1 Release status of recent non-commercial HLS tools...... 24 3.2 LegUpmemorysignals...... 31 3.3 LegUp C languagesupport...... 33 3.4 CorebenchmarkprogramsincludedwithLegUp...... 33 3.5 Speedperformanceresults...... 38 3.6 Arearesults...... 41 3.7 Power and energy results [Aldha 11b]...... 42 3.8 LegUp 1.0 vs. current LegUp version. (Hardware-only Implementation) ...... 45

4.1 Area results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) ...... 55 4.2 Speed performance results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) . . 55

5.1 AlgorithmExample(II=3)...... 69 5.2 Minimum initiation interval of benchmarks for balanced vs. proposed restructuring . . . . 73 5.3 Operation and dependency characteristics of each benchmark ...... 73 5.4 Speedperformanceresults ...... 74 5.5 Speedperformanceresults ...... 75 5.6 Areacomparisonexperimentalresults ...... 76 5.7 Toolruntime(s)comparison...... 76

6.1 Naive grouped RAM memory allocation ...... 94 6.2 Grouped RAM memory allocation with reduced fragmentation ...... 96 6.3 Memoryarchitectureperformanceresults ...... 98 6.4 Memoryarchitecturearearesults ...... 99

7.1 SobelGradientMasks...... 102 7.2 ExperimentalResults...... 107

viii List of Figures

1.1 Clock frequency scaling trends [Stan 14] ...... 2 1.2 Cost per gate scaling trends [Inte 13]. 90nm - 20nm costs assume two years of high volume production. 16/14nm costs are estimated for FinFET in 2016...... 2

2.1 Spectrumofcomputationplatforms...... 7 2.2 High-LevelSynthesisflow...... 9 2.3 Control flow graph (CFG) and data flow graph (DFG) of Figure 2.4...... 10 2.4 CcodeforFIRfilter...... 10 2.5 LLVMIRforFIRfilter...... 11 2.6 Scheduling the DFG of a basic block...... 13 2.7 Basic block ASAP scheduling ignoring resource constraints...... 13 2.8 Scheduled FIR filter LLVM instructions with data dependencies...... 14 2.9 Systemofdifferenceconstraintsgraph...... 16 2.10 Circuit datapath after binding for the given schedule (given one adder)...... 17 2.11BipartiteGraph...... 19 2.12 Cyclone II and Stratix IV logic element architectures...... 20

3.1 LegUpDesignMethodology...... 26 3.2 LegUp target hybrid processor/accelerator architecture...... 27 3.3 LegUphardwaremoduleinterface...... 30 3.4 Initial state of a LegUp hardware module’s finite state machine...... 30 3.5 Vector addition C functiontargetedforhardware...... 32 3.6 Modified C function to call hardware accelerator for function in Figure 3.5...... 33 3.7 Summary of geomean experimental results across the benchmarksuite...... 43

4.1 Multi-pumped multiplier (MPM) unit architecture...... 50 4.2 Clock follower circuit from [Tidwe 05]...... 51 4.3 Multi-pumped multiplier unit FMax characterization...... 51 4.4 Multi-pumped multiplier unit register characterization...... 52 4.5 Loop schedule: multiplier sharing vs. multi-pumping...... 53 4.6 Loophardware: originalvs. resourcesharing...... 53 4.7 Image after Sobel edge detection and Gaussian blur...... 55

5.1 Time sequence of a loop pipeline with II=2 and five loop iterations (i = 0to4)...... 61 5.2 Loop pipelining with a recurrence...... 63

ix 5.3 Ccodeforloop...... 63 5.4 Loop pipelining Figure 5.3 with II=2...... 64 5.5 SDCModuloSchedulingforII=3...... 65 5.6 Restructured loop dependency graph achieves II=1 ...... 71 5.7 Dependencygraphrestructuring...... 72 5.8 Incremental Associativity Transformation...... 72 5.9 Backtracking SDC modulo scheduling experimental results...... 75 5.10 Runtime Characterization For Loop Pipelining Scheduling Algorithms...... 78 5.11 Initiation Interval for Loop Pipelining Scheduled in Figure 5.10...... 78

6.1 C snippet showing an example of global and function-scoped memory variables...... 83 6.2 LLVM intermediate representation example showing global and stackmemory...... 84 6.3 HLS memory binding and memory interconnection network...... 84 6.4 LegUp32-bitpointeraddressencoding...... 85 6.5 LegUp memory controller block diagram...... 86 6.6 LegUp shared memory controller when loading array element output[13]...... 88 6.7 Relationship between program call graph and hardware module instantiations...... 88 6.8 Multiplexing required for the memory address at each level of the module hierarchy. . . . 89 6.9 Local and global memory addressing logic within the hardware moduledatapath . . . . . 90 6.10 LegUp allocating one physical RAM for each array...... 92 6.11 Grouping arrays into physical RAMs in LegUp’s shared memory controller...... 92 6.12 Groupedmemoryarrayaddressoffsets...... 93

7.1 Sobel stencil sliding over input image ...... 102 7.2 CcodeforSobelFilter...... 103 7.3 Sobel hardware line buffers and stencil shift registers ...... 103 7.4 Calculating the Sobel edge weight using the stencil window...... 104 7.5 C code for the stencil buffer and line buffers synthesized with LegUp...... 105 7.6 Optimized C code for synthesized Sobel Filter with LegUp...... 106

x Chapter 1

Introduction

Over the past four decades we have seen a tremendous increase in computing performance. This comput- ing progress has been driven by Moore’s law, an observation that the number of transistors on the latest integrated circuit doubles every 18 months [Moore 65]. Since the early 1970s, this trend has been accom- plished by scaling silicon transistors to smaller dimensions using Dennard scaling [Denna 74]. Dennard 1 showed that by scaling the transistor dimensions by √2 (70%), the transistor count doubles, frequency increases by 40%, and the total power remains constant. However, Dennard scaling ended in the early 2000s, due to increased transistor leakage current, which prevented a reduction in transistor threshold volatage VT , which also limited the scaling of the power supply voltage [Denna 07]. As frequency con- tinued to scale, chip power densities increased exponentially eventually producing a thermal gradient at the limit of what could be cooled using reasonable technology [Ellsw 04]. Consequently, computer processor clock frequencies plateaued in 2004 as shown in Figure 1.1, which implies that single-thread processor performance will eventually stagnate. Chip manufacturers reacted by increasing the number of processing cores available to achieve performance gains, with four cores typical today [Lempe 11]. Moore’s law has continued, with transistors doubling every generation but with significantly less improvement to transistor performance and energy efficiency. Given a fixed chip power budget, as we increase the number of transistors without improving the per-transistor energy efficiency, we must power off a portion of the computer chip, a trend called dark silicon. Projections show that 50% of the chip could be “dark” within three process generations [Esmae 13]. Even Moore’s law may soon end due to economic considerations. Silicon foundries such as Taiwan Semiconductor Manufacturing Company (TSMC) are finding that for the newer processes, the cost per gate is no longer decreasing, as shown in Figure 1.2. The latest 16/14nm FinFET [Hisam 00] transistors are estimated to cost $0.0162 per million gates in 2016, which is 14% more than today’s 20nm transistors costing $0.0142. If Moore’s law does end, this may lead to commoditization of the semiconductor industry with correspondingly lower profit margins and a shift to maintaining pre-existing products with less new development. These recent trends motivate chip designers to use silicon area more power efficiently due to chip power constraints. Furthermore, we cannot rely on Moore’s law to continue to increase computational performance; instead we need to squeeze more performance out of the available transistors. Going forward, these constraints will increasingly be met by heterogeneous computing: combining traditional multi-cores with customized hardware accelerators that offer better energy efficiency and performance for specific applications [Brodt 10]. As evidence of the shift to heterogeneous computing, we observe that 30 of the top 100 supercomputers are using accelerator/co-processor technology as of November

1 Chapter 1. Introduction 2

Figure 1.1: Clock frequency scaling trends [Stan 14]

Figure 1.2: Cost per gate scaling trends [Inte 13]. 90nm - 20nm costs assume two years of high volume production. 16/14nm costs are estimated for FinFET in 2016. Chapter 1. Introduction 3

2014 [TOP5 14]. Seventeen supercomputers are using Nvidia graphics processing units (GPUs), eleven are using Intel Xeon Phi co-processors, one uses AMD GPUs, and one uses IBM PowerXCells. Intel Phi is a multi-core 512-bit SIMD x86-compatible co-processor platform that plugs into a standard PCIe slot and has a peak performance of 1 TFLOPS of double precision (2 TFLOPS in single precision) [Heine 13]. In the rapidly growing mobile space, mobile system-on-chips (SoCs) now contain specialized hardware cores to save power, including: a digital signal co-processor (DSP), GPU, sensor core, GPS, modem, and multimedia cores [Yang 14]. We are entering an accelerator era, where hardware accelerators are common in heterogeneous many-core systems [Borka 11].

1.1 Research Motivation

There are a few ways of developing a custom hardware implementation of a set of computations. Application-specific integrated circuits (ASICs) offer the highest performance and lowest power cus- tom accelerators, but are uneconomical for most applications — ASIC chip design requires over $100M in non-recurring engineering costs at the 28nm process node [Quinn 15]. Instead, another way to re- alize these custom hardware accelerators is using field-programmable gate arrays (FPGAs), which are integrated circuits that can be programmed to implement arbitrary digital logic. FPGAs have the ad- vantage of being reprogrammable, so they can offer some of the advantages of custom hardware without requiring the user to fabricate a custom computer chip. Additionally, FPGA devices have grown larger in recent years and they can now accommodate a complete system-on-chip including an embedded ARM hard-processor, like you would find in a smart phone [DE1 13]. Therefore, this dissertation focuses on FPGAs as a target platform for developing hardware accelerators. Research has shown that implementing a design on an FPGA can offer orders of magnitude improve- ment over a processor in terms of energy efficiency and performance for some applications [Cong 09, Luu 09]. However, custom hardware on FPGAs has not yet been widely adopted for general purpose compute acceleration. Adoption has been limited by two factors. First, FPGAs have historically had poor floating point performance compared to GPUs and CPUs, and therefore had a low cost effective- ness for high performance computing [Crave 07]. However, this may soon change with the new Altera Stratix 10 FPGA [Stra 14], which claims 10 TFLOPs of single-precision floating point performance using hardened floating point cores. Second, we believe that a major impediment to FPGAs is that the cost and difficulty of hardware design is often prohibitive, and consequently, a software approach is used for most applications. A typical high performance computing user is a scientist or researcher looking to accelerate a scientific application and typically they have no hardware design knowledge. Design effort for an FPGA implementation is typically an order of magnitude greater than software development, due to a lower level of abstraction [Rupno 11]. A hardware engineer must choose a suitable circuit datapath architecture down to the bit-level, implement control logic, verify the circuit functionality with a cycle- accurate simulator, and finally use a static timing analysis tool to ensure timing constraints are met. The market for FPGAs could be improved tremendously if this programmability hurdle could be lowered, especially considering that software developers outnumber hardware designers 10 to 1 [Occu 10]. The overarching aim of my PhD research is to offer a new programming paradigm for FPGAs that simplifies the design process for engineers familiar with software development. We propose the following incremental design methodology. First, the designer implements their application in software using C, targeting a processor running on the FPGA device. As the application executes, a built-in profiler iden- Chapter 1. Introduction 4 tifies critical sections of the code that would benefit from a hardware implementation. These segments are then automatically synthesized into hardware accelerators, which the processor uses to improve performance. In this self-accelerating adaptive system, the designer can harness the performance and energy benefits of an FPGA using an incremental design methodology. Alternatively, we can synthesize the entire program into hardware. By designing at a higher level of abstraction, the circuit designer can work more productively and achieve faster time-to-market than using hand-coded register transfer level (RTL) designs.

We have implemented our described approach in an open-source research framework called LegUp. LegUp allows designers to compile C code directly into a functionally equivalent hardware implementation that can be programmed onto an FPGA/processor embedded system. This compilation process, referred to as high-level synthesis (HLS) in the literature, involves automatically generating a cycle-accurate RTL circuit description from a high-level untimed C software specification. High-level synthesis has been studied in academia since the 1980s [McFar 88, Pauli 89, Gajsk 92] to address the issue of hardware design complexity by allowing engineers to use software to describe hardware. In recent years, high- level synthesis has gained traction as a viable approach for designing hardware as evidenced by new commercial offerings from the two largest FPGA vendors: OpenCL from Altera [Open] and Vivado HLS from Xilinx [Xili]. However, HLS is still primarily used by hardware designers at companies like Samsung, Qualcomm, Sony, and Toshiba [Cooleb].

Despite high-level synthesis being a well-studied research area, in 2011 there were no robust open- source platforms for performing HLS research, forcing academics to build up their own infrastructure from scratch. The infrastructure required for high-level synthesis is quite extensive. LegUp is built within the open-source C compiler, LLVM [Lattn 04], which includes modern compiler optimizations. We provide support for synthesizing all the various language constructs of ANSI C into hardware, except function pointers and recursion. LegUp-generated RTL is synthesizable on Altera FPGAs and utilizes block RAMs, multipliers, dividers, and floating point units. We also support hardware/software partitioning using various processors: 1) a soft MIPS core, or 2) a hard ARM core, or 3) an X86 processor connected via PCI express. LegUp automatically generates the interconnection logic between the HLS-generated accelerators and the processor. We also generate software logic running on the processor to marshal data to and from the hardware accelerators and to control their execution. The high-level synthesis research community was lacking a robust well-tested open-source academic infrastructure that could lower the barrier to entry for new researchers—LegUp fills this gap.

Since our first release of LegUp in March 2011, the project has been well-received in the aca- demic community. Our original conference paper [Canis 11] has 148 citations and we have had two invited papers for the LegUp project [Fort 14, Canis 13b]. LegUp is open-source and freely available (http://www.legup.org) and the source code has been downloaded by over 1200 unique researchers from outside the University of Toronto since our first release. LegUp enables future high-level synthesis re- search projects in the spirit of the Verilog-to-Routing (VTR) system for FPGA CAD research [Luu 14a]. In the long term, we hope our research will lead to wider adoption of FPGAs by software engineers, al- lowing them to implement fast and energy-efficient FPGA applications in areas such as cancer treatment, gene sequencing, finance, and oil exploration. Chapter 1. Introduction 5

1.2 Research Contributions

The aim of my PhD research is to achieve three broad goals:

1. Make FPGAs easier to program.

2. Provide an open-source framework to enable further research in high-level synthesis.

3. Improve high-level synthesis quality of results towards generating FPGA designs comparable to hand-written RTL implementations.

In order to achieve these goals, we make several contributions, as summarized below:

Chapter 3 presents LegUp, an open-source high-level synthesis implementation and an associated set of benchmarks. We show that with LegUp, a hardware designer can program an FPGA using only C—without writing a single line of RTL. Here we give an overview of the LegUp design flow, describing the high-level synthesis algorithms performed at each step of the process and we provide a description of the final circuit architecture generated by LegUp. Furthermore, we quantitatively assess LegUp’s quality of results, by comparing LegUp to a commercial HLS tool (eXcite) using the largest HLS benchmark suite available in the literature (CHStone). These comparisons show that circuits synthesized by LegUp are comparable to the commercial tool eXcite, with a geomean benchmark execution time that is 18% faster than eXcite, while having 16% higher geomean area. This work has been published in [Canis 11, Canis 12, Canis 13b]. In later chapters, we show that LegUp enables us to conduct research and evaluate new high-level synthesis algorithms.

Chapter 4 uses LegUp to investigate novel FPGA architecture-specific enhancements to high-level synthesis. We present a new approach to resource sharing that allows multiple operations to be performed by a single functional unit in one clock cycle. Our approach is based on multi- pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2×, allowing multiple computations to complete in a single system cycle. Our method is particularly effective for the DSP blocks on modern FPGAs. We show that multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance. This work has been published in [Canis 13a].

Chapter 5 describes an improved high-level synthesis scheduling algorithm. In many C applications, the majority of run time is spent executing critical loops. The high-level synthesis scheduling technique called loop pipelining, exploits parallelism across loop iterations to generate hardware pipelines. Loop pipelining increases parallelism and hardware utilization, creating circuits similar to hand- coded hardware architectures. However, industrial designs often have resource constraints and constraints imposed by loops with cross-iteration dependencies. The interaction between multiple constraints can pose a challenge for HLS scheduling algorithms, which, if not handled properly can lead to suboptimal loop pipeline schedules. We present a novel scheduler based on the SDC scheduling formulation [Cong 06b] that includes a backtracking mechanism to properly handle conflicting scheduling constraints and achieve better pipeline performance. The SDC formulation has the advantage of being a mathematical framework that supports flexible constraints that are useful for more complex loop pipelines. Furthermore, we describe how to specifically apply associative expression transformations during scheduling to restructure recurrences in complex Chapter 1. Introduction 6

loops to enable better scheduling. We compared our techniques to existing prior work in loop pipeline scheduling in HLS [Zhang 13] and also compared against a state-of-art commercial tool. Over a suite of benchmarks, we show that our approach can result in a geomean wall-clock time reduction of 32% versus prior work, and 29% versus a commercial HLS tool. This work has been published in [Canis 14].

Chapter 6 presents the on-chip memory architecture synthesized by LegUp. We partition application memory into local and global physical memory regions in the final circuit by using pointer analysis performed by LegUp at compile time. We also group physical memory to match the underlying hardware characteristics of the FPGA chip, which typically store on-chip memory in dedicated memory blocks. Our architecture is generally applicable to a wide range of input C applications, even when pointer analysis cannot statically determine where each C pointer can reference in memory. We measure the impact of local memories and grouping global memories compared to only having global memory, across the CHStone benchmark suite targeting the Stratix IV FPGA family. We observe an improvement in the geomean memory implementation bits by 37%, and a reduction in geomean wall-clock time by 12%. This work has been published in [Fort 14].

Chapter 7 explores a case study of a kernel from a common edge detection algorithm: the Sobel filter. We synthesize the circuit in LegUp from a C description and then compare the design to an equivalent hand-written RTL implementation. We show that performance is comparable with the wall-clock time of the LegUp-generated circuit only 2% higher than the hand-designed circuit. However, the synthesized circuit used more FPGA device resources, requiring 64% more ALUTs and 66% more registers. We motivate topics for future work, particularly the coding-style changes to the input C code required to achieve comparable performance. We need a way of either detecting these optimizations or developing a style guide with associated intuition for software developers targeting an efficient hardware architecture.

1.3 Organization

The remainder of this PhD dissertation is organized as follows: Chapter 2 reviews the background material relevant to the research and presents related work. The research contributions are presented in Chapters 3, 4, 5, 6, 7. Chapter 8 summarizes conclusions and gives suggestions for future work. Appendix A provides a brief summary of the LegUp open-source codebase. Chapter 2

Background and Related Work

2.1 Introduction

This chapter presents background material and related work that will provide the reader with enough knowledge to understand the research contributions of this dissertation. Section 2.2 gives an overview of available devices for computation and where our research lies within the computational landscape. Section 2.3 reviews prior work in high-level synthesis, discussing the algorithms required to turn a high-level C description into hardware. Section 2.4 discusses compiler terminology and describes LLVM, the open-source compiler framework. Sections 2.5–2.7, describe the high-level synthesis subproblems: allocation, scheduling, and binding. Section 2.8 gives an overview of our target FPGA architectures.

2.2 Modern Computation Platforms

A spectrum of computation platforms are currently available. We highlight a few popular platforms in Figure 2.1. These platforms offer a trade-off between ease of programing and performance in terms of computation throughput or lower power or both. On the left, we have general purpose processors from Intel or AMD. The vast bulk of computation is performed on this adaptable platform. In the mobile space, ARM processors dominate due to their lower power and customizability. From left to right, we have specialized hardware cards such as graphics processing units (GPUs) from Nvidia or ATI. GPUs are programmed with languages like CUDA [CUDA 07] and OpenCL [Openc 09], which allow program- mers to harness the single-instruction multiple-data (SIMD) architecture of GPUs for general purpose

Better Performance

General Purpose DSP Custom GPU FPGA Processor Processor ASIC

Easier Programmability

Figure 2.1: Spectrum of computation platforms.

7 Chapter 2. Background and Related Work 8 computation. GPU devices are particularly effective for floating point intensive computation. Next, we have digital signal processing (DSP) processors from Qualcomm [Codre 14] or Texas Instruments, these support parallel multiply-accumulate operations in a SIMD architecture, as required by signal processing (cell baseband towers, TV signal decoding). Programming DSP processors typically involves a mixture of C programming and assembly hand-tuning. Finally, we have more custom hardware so- lutions such as FPGAs, which have traditionally been used in low-volume applications, and mostly for telecom/switching/networking applications [Xili 14]. FPGA vendor Altera receives 16% of their revenue from Huawei [Alte 13], who use FPGAs to route cell packets received at high-bandwidth (>100Gbps) in cell tower baseband stations between many DSP-specific processors which handle signal processing. As telecommunication companies upgrade to 4G data networks, this is a growing market area, with global mobile data traffic expected to grow 11-fold by 2018 [Cisc 14]. Lastly, we have custom application- specific integrated circuits (ASICs), which are typically designed using a standard cell library specific to the silicon fabrication process node, for instance 28nm. Standard cells are selected, placed, and routed using electronic design automation (EDA) tools after which, lithography masks are made for the final layout and the ASIC is fabricated in silicon. ASICs can be extremely costly to fabricate: Intel’s latest 14nm fabrication plant was estimated to cost $5B [List 14]. Companies with lower volumes can save costs by using shared fabrication facilities at a foundry like Taiwan Semiconductor Manufacturing Com- pany (TSMC). ASICs fabricated with older mature process nodes can be significantly cheaper due to higher yields and sunk capital costs. For example, fabricating an ASIC at 65nm (from 2006) costs an estimated $500,000, and at 130nm (from 2001) costs $150,000 [Taylo 13]. We face a discontinuity in this spectrum, in terms of design complexity, when comparing software implementations running on a processor to hardware design targeting an FPGA/ASIC. In the latter, the designer implements a custom circuit in a hardware description language, and must synchronize all com- putation across a massively parallel digital circuit down to the level of individual clock cycles. Hardware design is error prone and can be notoriously difficult to debug, requiring cycle-accurate circuit simula- tions. Software design is comparatively straightforward, as we typically describe software sequentially or with limited parallelism, and we use mature freely accessible compilers and debugging tools. However, a custom hardware implementation can provide a significant improvement in speed and energy-efficiency versus a software implementation (e.g. [Cong 09, Luu 09, Zhang 12]). Despite the apparent energy and performance benefits, hardware design is usually too difficult and costly for most applications, and a software approach is preferred. If we could allow software to incrementally and automatically compile into a hardware solution, we could lower the barrier to entry of hardware design.

2.3 High-Level Synthesis Flow

High-Level Synthesis (HLS) is the compilation process of turning an untimed high-level algorithm, typ- ically described in C, into cycle-accurate hardware description language specifying a digital circuit. We begin with a general overview of the high-level synthesis flow, and then discuss each step in detail along with relevant prior academic work. HLS is an NP-hard combinatorial problem, so academics have traditionally used divide and conquer to break the task into distinct subproblems [Couss 09], the most important being scheduling and binding. Figure 2.2 shows the typical HLS flow. First, the user specifies a program in a high-level language, which in this dissertation we will assume to be ANSI C. The program is compiled, in our case by Chapter 2. Background and Related Work 9

C Compiler Optimized LLVM IR C Program (LLVM) Target H/W Allocation Characterization

Scheduling User Constraints • Timing • Resource

Binding

RTL Generation

Synthesizable Verilog

Figure 2.2: High-Level Synthesis flow.

LLVM, to optimize the code and produce an intermediate representation (IR). The LLVM compiler and intermediate representation will be described shortly. Next, we perform the allocation step, which reads user-provided constraints to determine the amount of hardware available for use (e.g., the number of multiplier functional units), and also manages other hardware constraints (e.g., speed, area, and power) or other constraints imposed by the target hardware architecture. After the hardware is allocated, we solve the most important HLS subproblem: scheduling. Scheduling assigns each operation in the input program to a control step (state) that occurs during a particular clock cycle. Scheduling decisions can have significant performance implications and we face challenges extracting parallelism from the untimed program description. After scheduling, we perform binding, which assigns each of the program’s operations to a specific functional unit in the hardware, sharing functional units where possible to save area. Binding also shares registers/memories between variables and assigns memory ports to particular load/store operations. Finally, we generate a suitable finite state machine and datapath based on the results of scheduling and binding, while meeting the user constraints from the allocation step and output the corresponding RTL description. The final circuit description is synthesizable on a target FPGA using standard RTL synthesis tools [Quar 14].

2.4 C Compiler: Low-Level Virtual Machine (LLVM)

High-level synthesis is typically implemented as a series of backend compiler passes in an existing com- piler, so we will briefly review prerequisite compiler terminology [Lam 06]. A control flow graph (CFG) of a program is a directed graph, where vertices map to basic blocks, which represent computation, and edges map to branches, which represent control flow. For example, given two basic blocks: b1 and b2, if b1 can branch to b2 then b1 has an edge to b2 in the CFG. A basic block is a contiguous set of non-branching instructions with a single entry (at its beginning) and exit point (at its end). Within a basic block, the flow data dependencies between instructions form an acylic directed graph, called a Chapter 2. Background and Related Work 10

Data Flow Graph

Control Flow Graph Load Load Load

BB0

+

BB1

+

BB2

Store

Figure 2.3: Control flow graph (CFG) and data flow graph (DFG) of Figure 2.4.

y[n] = 0; for(i = 0; i < 8; i++) { y[n] += coeff[i] * x[n - i]; }

Figure 2.4: C code for FIR filter. data flow graph (DFG). Consider an 8-tap finite impulse response (FIR) filter whose output, y[n], is a weighted sum of the current input sample, x[n] and seven previous input samples. The C code for calculating the FIR response is given in Figure 2.4. Figure 2.3 shows the corresponding CFG of the FIR filter, where the loop is indicated by the back edge of the second basic block. We also provide the data flow graph of the FIR loop body, where we multiply two values from memory and store the sum. In this dissertation, we leverage the popular open-source low-level virtual machine (LLVM) compiler framework [Lattn 04] – the same framework used by Apple for iPhone/iPad application development. At the core of LLVM is an intermediate representation (IR), which is essentially machine-independent assembly language. C code is translated into LLVM’s IR then analyzed and modified by a series of compiler optimization passes. Current results show that LLVM produces code of comparable quality to gcc for x86-based processor architectures. Figure 2.5 gives the unoptimized LLVM IR corresponding to the FIR filter C code we gave in Fig- ure 2.4. Register names in the IR are prefixed by “%” and there is no restriction on the number of registers. The LLVM IR is in static-single assignment (SSA) form, which ensures that each register is only assigned once, guaranteeing a 1-to-1 correspondence between an instruction and its destination register. Types are explicit in the IR. For example, i32 specifies a 32-bit integer type and i32* specifies a pointer to a 32-bit integer. In the example IR for the FIR filter in Figure 2.5, line 1 marks the beginning of a basic block called entry. Lines 2 and 3 initialize y[n] to 0. Line 4 is an unconditional branch to a basic block called bb1 that begins on line 5, corresponding to the C loop body. phi instructions are needed to handle control flow-dependent variables in SSA form. For example, the phi instruction on line 6 assigns loop index register %i to 0 if the previous basic block was entry; otherwise, %i is assigned to register %i.new, which Chapter 2. Background and Related Work 11

1: entry: 2: %y.addr = getelementptr i32* %y, i32 %n 3: store i32 0, i32* %y.addr 4: br label %bb1 5: bb1: 6: %i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ] 7: %coeff.addr = getelementptr [8 x i32]* %coeff, i32 0, i32 %i 8: %x.ind = sub i32 %n, %i 9: %x.addr = getelementptr i32* %x, i32 %x.ind 10: %0 = load i32* %y.addr 11: %1 = load i32* %coeff.addr 12: %2 = load i32* %x.addr 13: %3 = mul i32 %1, %2 14: %4 = add i32 %0, %3 15: store i32 %4, i32* %y.addr 16: %i.new = add i32 %i, 1 17: %exitcond = icmp eq i32 %i.new, 8 18: br i1 %exitcond, label %return, label %bb1 19: return:

Figure 2.5: LLVM IR for FIR filter. contains the incremented %i from the previous loop iteration. The getelementptr instruction on line 7 performs address computation to initialize a pointer %coeff.addr to the address of coeff[i]. The getelementptr instruction has three operands, a pointer %coeff to the coefficient array, an offset from that pointer (0), and an offset from the start of the coefficient array, %i. Lines 8 and 9 initialize a pointer to the input sample array, x[n-1]. Lines 10-12 load the sum y[n], input sample and coefficient into registers. Lines 13 and 14 perform the multiply-accumulate: y[n] + coeff[i] * x[n-i]. The result is stored in y[n] on line 15. Line 16 increments the loop index %i by one. Lines 17 and 18 compare %i with loop limit (8) and branch accordingly. Observe that LLVM instructions are simple enough to directly correspond to hardware operations (e.g., a load from memory, or an arithmetic computation). Our HLS tool operates directly with the LLVM IR, scheduling the instructions into specific clock cycles. Scheduling operations in hardware requires knowing data dependencies between operations. Fortu- nately, the SSA form of the LLVM IR makes this easy. For example, the multiply instruction (mul) on line 13 of Figure 2.5 depends on the results of two load instructions on lines 11 and 12. Memory data dependencies are more problematic to discern; however, LLVM includes alias analysis – a compiler technique for determining which memory locations a pointer can reference. In Figure 2.5, the store on line 15 has a write-after-read dependency with the load on line 10, but has no memory dependencies with the loads on lines 12 and 13. Alias analysis can determine that these instructions are independent and can therefore be performed in parallel.

2.5 Allocation

We will now describe each step of high-level synthesis in detail beginning with allocation. We will emphasize features from LegUp, our high-level synthesis tool discussed in Chapter 3, and we will assume an FPGA target device without loss of generality. Allocation sets up the constraints for the high-level synthesis problem by specifying target hardware Chapter 2. Background and Related Work 12 properties and any user-given parameters. LegUp reads allocation information from a configuration Tcl file, which specifies:

• The target board and FPGA device family.

• The required circuit clock period.

• The limit (if any) on functional units available for each operator type.

• Which functional units should be shared.

• The number of pipeline stages in each functional unit.

• The number of memory ports and memory latency.

• Specific HLS optimizations: minimize bitwidth, loop pipelining, etc.

• The estimated delay of each functional unit using FPGA device characterization.

All of these allocation parameters have sensible default values. The user typically only manually specifies the target board and, if necessary, specific HLS optimizations. Based on the target FPGA device family, LegUp will automatically select a default clock period constraint that we have previously found to achieve the highest performance across our benchmarks. We have selected the default pipeline stages of each functional unit to minimize impact on the overall circuit clock frequency. For instance, a floating point adder has 14 pipeline stages by default. A functional unit is an instantiated module in the hardware, for instance a multiplier. An operation is synonymous with an LLVM instruction in the program. Multiple operations can share a compatible functional unit by adding multiplexers to the input ports of the shared functional unit. By default, LegUp does not limit the number of available functional units for integer add/subtract, bitwise, shift, and comparators because multiplexers are costly to implement in FPGAs. For example, a 32-bit adder can be implemented using 32 4-input LUTs (and associated carry logic), but a 32-bit 2-to-1 multiplexer also requires 32 4-input LUTs – the same number of LUTs as the adder itself. Since we cannot save area by restricting these functional units, LegUp generates wide datapaths that can benefit from instruction level parallelism in the input program. For multiplier functional units, we can use hard multiplier blocks in the FPGA fabric. LegUp will share multipliers if the synthesized program uses more multiply operations than hard blocks available in the FPGA. Other functional units, such as divide/modulus or floating point units, are implemented with LUTs and consume significant area. Therefore, by default LegUp limits the number of divide and remainder units to one, and allows only one of each type of floating point unit. We allow the user to override these defaults in the configuration file to achieve higher parallelism and performance at the cost of area. Like other HLS tools [Xili], LegUp does not support constraining the overall circuit area (e.g. use less than 1000 logic elements) subject to a timing constraint. This area constraint would be redundant, as by default LegUp attempts to reduce area as much as possible while still satisfying the timing constraint. A user can use these allocation settings to easily perform design space exploration and gain greater control over the final LegUp-generated datapath. Chapter 2. Background and Related Work 13

Figure 2.6: Scheduling the DFG of a basic block.

1 For each Instr in BasicBlock 2 state =0 3 For each Operand of Instr 4 continue if outsideBasicBlock(BasicBlock, operand) 5 operandState = getState(Operand) 6 if latency(operand) > 0 7 state = max(state, operandState + latency(operand)) 8 else if delay(Operand) + delay(Instr) > maxClockPeriod 9 state =max(state, operandState + 1) 10 e l s e 11 state =max(state, operandState) 12 end if 13 End For 14 assignState(Instr , state) 15 End While

Figure 2.7: Basic block ASAP scheduling ignoring resource constraints.

2.6 Scheduling

In high-level synthesis, scheduling is the task of assigning operations to execute during specific clock cycles, or control steps, such that all program data dependencies and resource constraints are satisfied. The goal of scheduling is to minimize the total time needed to complete the program while satisfying all constraints. We can think of the program’s CFG as a coarse representation of the finite state machine (FSM) needed to control the hardware being synthesized – the nodes and edges are analogous to those of a state diagram. Each branch condition in the CFG will become a state transition in the final FSM. What is not represented in this coarse FSM are data dependencies between operations within a basic block and the latencies of operations (e.g., a memory access may take more than a single cycle). After constructing a coarse FSM from the CFG, we schedule each basic block individually. Figure 2.6 gives the schedule and corresponding FSM for the basic block DFG we saw previously in Figure 2.3, where each operation has been scheduled to occur in a particular FSM state (clock cycle). Given this schedule, the basic block will take four cycles to complete in hardware. Our memory controller is dual ported, taking advantage of FPGA on-chip RAM and allowing two load/stores to be performed every cycle. To satisfy this resource constraint, we scheduled only two loads in the first state and we pushed the third load to the next state. Alternatively, we could have scheduled the third load in first state, however, this would have pushed one of the first two loads to the next state, lengthening the overall schedule by one cycle. Chapter 2. Background and Related Work 14

0 1 2 3 4 5 6 7 8

%y.addr = getelementptr i32* %y, i32 %n

store i32 0, i32* %y.addr

br label %bb1

%i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]

%coeff.addr = geteptr i32* %coeff, i32 0, i32 %i

%x.ind = sub i32 %n, %i

%x.addr = getelementptr i32* %x, i32 %x.ind

%0 = load i32* %y.addr

%1 = load i32* %coeff.addr

%2 = load i32* %x.addr

%3 = mul i32 %1, %2

%4 = add i32 %0, %3

store i32 %4, i32* %y.addr

%i.new = add i32 %i, 1

%exitcond = icmp eq i32 %i.new, 8

br i1 %exitcond, label %return, label %bb1

Figure 2.8: Scheduled FIR filter LLVM instructions with data dependencies.

The simplest scheduling approach, which ignores resource constraints, is as-soon-as-possible (ASAP) scheduling [Gajsk 92]. ASAP scheduling assigns an instruction to the first state after all of its depen- dencies have been computed, guaranteeing the shortest schedule. We provide pseudocode for ASAP scheduling in Figure 2.7, which assigns a state number, starting from zero, to each instruction. Here, we visit the instructions within each basic block in topological order (line 1) and loop over each instruction’s operands (line 3). The operands for each instruction are either: 1) from this basic block and therefore guaranteed to have already been assigned a state, or 2) from outside this basic block, in which case we can safely assume they will be available before control reaches this basic block (line 4). For operands with multi-cycle latencies, such as pipelined divides or memory accesses, we schedule the instruction after the instruction producing the operand has completed (line 7). Usually an instruction will be scheduled one cycle after all of its operands have completed (line 9). In some cases, we can schedule an instruction into the same state as one of its operands, which is called operation chaining. We perform chaining in cases where the estimated delay of the chained operations (from allocation) does not exceed the estimated clock period for the design (line 11). Chaining can reduce hardware latency (# of cycles for execution) and save registers without impacting the final clock period. Figure 2.8 is a Gantt chart showing the ASAP schedule of the FIR filter LLVM instructions shown in Figure 2.5. The chart shows the same LLVM instructions, now scheduled into nine states. Data Chapter 2. Background and Related Work 15 dependencies between operations are shown; in this case we do not allow operation chaining (for clarity). We assume that load instructions have a two cycle latency. Once a load has been issued, a new load can be issued on the next cycle. In the presence of resource constraints, HLS scheduling is an NP-hard problem that can be solved with integer linear programming [Hwang 91] or approximately solved using various heuristics. There are two conventional categories of scheduling heuristics: resource-constrained, where the number of functional units is specified, and time-constrained where the maximum cycle length of the schedule is specified. Resource-constrained HLS scheduling is typically performed using the list scheduling technique [Adam 74]. A list scheduler keeps track of a list of candidate operations to schedule at the current time step. An op- eration is a candidate if all data dependencies are met and if there are still compatible resources available. The choice of candidate operations is based on a priority, the simplest being either as-soon-as-possible (ASAP), which schedules operations as soon as their data dependencies are met, or as-late-as-possible (ALAP), which schedules the operations at the latest time while still maintaining the overall schedule length achieved using ASAP scheduling. Operations are taken from the candidate list in order of priority and commited to the current time step, then the candidate list is updated to reflect operations that can now be scheduled. A common priority function used in list scheduling is the mobility [Pangr 87] of each operation, which is the difference between the ASAP scheduled time and ALAP scheduled time. The mobility gives a measure of the scheduling flexibility of an operation, with zero indicating that delaying this operation will lengthen the overall schedule. Force-directed scheduling [Pauli 89] is an example of time-constrained scheduling. This approach uses the mobility range of each operation to estimate the distribution of resource requirements. For each resource type, this distribution gives a measure of how many operations could be scheduled at a particular time step. Operations are then selected to balance this distribution and minimize resource usage at each time step while still meeting the schedule length constraint. For control-intensive programs, we typically have many smaller basic blocks and a complex control flow graph. This can lead to many cycles being spent due to the control flow of the program, when oper- ations from different basic blocks could have been overlapped. SPARK [Gupta 03] proposed extending the candidate operations during list scheduling to include operations outside the current basic block, thereby speculating by executing an operation early hoping that we will need the results of the oper- ation. Another approach is path-based scheduling [Campo 91], which schedules all possible execution paths though the CFG independently and then combines them into a final schedule.

2.6.1 SDC Scheduling

Previously discussed scheduling heuristics suffer from local optimization choices, while the optimal branch and bound scheduling approaches are too slow for large programs. Alternatively, state-of-the-art HLS scheduling uses a mathematical framework, called a system of difference constraints (SDC) to describe constraints related to scheduling [Cong 06b]. The SDC framework is flexible and allows the specification of a wide range of constraints such as data and control dependencies, resource constraints, relative timing constraints for I/O protocols, and clock period constraints. A system of difference constraints formulation is a set of difference constraints:

vi − ui ≤ C (2.1) Chapter 2. Background and Related Work 16

1 c4 −3 c0 c3 5 4

c1 c2 −8

Figure 2.9: System of difference constraints graph.

Where vi and ui are variables to be solved for and C is a constant real number input into the equation. Here is an example of a system of difference constraints:

c0 −c1 ≤ 5 c1 −c2 ≤ −8 c2 −c3 ≤ 4 (2.2) c3 −c4 ≤ −3 c4 −c0 ≤ 1

By limiting the constant C values to integers, the constraint matrix formed by a system of difference constraints has the property of being totally unimodular. A totally unimodular matrix is defined as a matrix whose every square submatrix has a determinant of 0, -1, or +1. Due to this property, the solution to the constrained linear programming (LP) problem is guaranteed to have integer solutions, which avoids expensive branch-and-bound required for solving integer linear programming problems. We can solve an SDC problem using a standard LP solver in polynomial time. A system of difference constraints can also be represented as a constraint graph with a vertex corre- sponding to each variable in the system and an edge for each difference constraint: ui → vi with an edge weight equal to C. The constraint graph for Equation (2.2) is shown in Figure 2.9. The SDC is feasible iff there are no negative cycles in the graph [Ramal 99], where a negative cycle in a graph occurs when the sum of edge weights along any cycle in the graph is negative. For example, given the constraints matrix in Equation (2.2), the sum of all rows is equal to: c0 − c0 ≤−1 which is clearly infeasible. In the graph we observe a negative cycle with a path length of −1. In SDC scheduling, each operation is assigned a variable that, after solving, will hold the clock cycle in which the operation in scheduled. Consider two operations, op1 and op2, and let the variable cop1 represent the cycle in which op1 is to be scheduled, and cop2 the cycle in which op2 is to be scheduled. If op1 depends on op2, then we must schedule op1 after op2 and we add the following difference constraint to the SDC formulation: cop2 − cop1 ≤ 0 (or equivalently: cop2 ≤ cop1). We can also incorporate clock period constraints into SDC scheduling. Let P be the target clock period and let C represent a chain of any N dependant combinational operations in the dataflow graph: C = op1 → op2 → ... → opN. Assume that T represents the total estimated combinational delay of the chain of N operations – computed by summing the delays of each operator. We can add the following timing constraint to the SDC formulation: ⌈T/P ⌉− 1 ≤ copN − cop1. This difference constraint requires that the cycle assignment for opN be at least ⌈T/P ⌉− 1 cycles later than the cycle in which op1 is scheduled. Such constraints control the extent to which operations can be chained together in a clock cycle. Chaining is permitted such that the target clock period P is met. A property of a system of difference constraints is that there are no unique solutions. For example Chapter 2. Background and Related Work 17

Schedule Datapath

load load FF + load 2-port RAM +

+

store

Figure 2.10: Circuit datapath after binding for the given schedule (given one adder). consider: c0 −c1 ≤ 5 (2.3) c1 −c2 ≤ −8

There are many feasible solutions to this linear system (c0,c1,c2) for example: (1, −4, 10) or (8, 4, 15). However, because all SDC variables correspond to operation schedule times, we can add an additional constraint on each variable: ci ≥ 0. Furthermore, SDC scheduling allows a linear objective function, for example to minimize the start time of each operation to achieve an As-Soon-As-Possible (ASAP) schedule: min Pi ci. Objective functions to minimize circuit power have also been proposed [Jiang 08]. SDC scheduling does not solve the scheduling problem optimally, as there is a linear ordering heuristic required when we have resource constraints. We refer the reader to [Cong 06b] for complete details of the formulation and how other types of constraints can be included.

2.6.2 Extracting Parallelism

Until now we have assumed the hardware generated by HLS is fairly sequential in nature, with only one state active during any clock cycle. However, hardware designs typically exploit parallelism. Parallel computation can specified explicitly by the user with a C library like Pthreads [Buttl 96], or compiler pragmas such as OpenMP [Buttl 96], or with language extensions like OpenCL [Openc 09]. Parallelism can also be inferred by the HLS tool, where we attempt to extract the parallelism automatically. In HLS, a common optimization is loop pipelining, which infers parallelism across loop iterations to generate hardware pipelines, which we will discuss further in Chapter 5.

2.7 Binding

Binding comprises two tasks: operation binding assigns operators from the program to specific hardware units, variable binding assigning program variables to registers. When multiple operators are assigned to the same hardware unit, or when multiple variables are bound to the same register, multiplexers are required to facilitate the sharing. The binding step is typically performed after scheduling, how- ever binding is interdependent with scheduling. In retrospect, we may wish to modify the schedule to Chapter 2. Background and Related Work 18 allow more binding to occur and save area. There have been proposals to do scheduling and binding simultaneously [Resha 05]. Figure 2.10 shows an example of a schedule and the corresponding circuit datapath after binding, ignoring memory addressing for clarity. We assume that allocation has given us one adder functional unit and a shared two-port memory. Each adder operation has been scheduled in distinct cycles, therefore we can assign both adders to the same functional unit. Normally, we would require two multiplexers, one on each input of the adder, but since one input always arrives from the memory output a multiplexer is unnecessary. We have three goals when binding operations to shared functional units. First, we want to balance the sizes of the multiplexers across functional units to keep circuit performance high. Multiplexers with more inputs have higher delay, so we wish to avoid having a functional unit with a disproportionately large multiplexer on its input. Second, we want to recognize cases where we have shared inputs between operations, letting us save a multiplexer if the operations are assigned to the same functional unit. Lastly, during binding if we can assign two operations that have non-overlapping lifetime intervals to the same functional unit, we can use a single output register for both operations. In this case we save a register, without needing a multiplexer. We use the LLVM live variable analysis pass to check for the lifetime intervals. To account for these goals we use the following cost function to measure the benefit of assigning operation op to function unit fu:

Cost(op,fu)= φ · existingMuxInputs(fu)+ β · newMuxInputs(op,fu) − θ · outputRegisterSharable(op,fu) (2.4) where φ =0.1, β = 1, and θ =0.5 to give priority to saving new multiplexer inputs, then output registers, and finally balancing the multiplexers. Here existingMuxInputs(fu) returns the number of multiplexer inputs already required by the functional unit fu. The function newMuxInputs(op,fu) returns the number of new multiplexer inputs required if we assign operation op to functional unit fu. For example, if op = a ∗ b and we have already assigned op2 = c ∗ b to the functional unit fu, then we will only need one additional multiplexer input for operand a, since operand b is shared. outputRegisterSharable(op,fu) returns one if the operation op has a lifetime interval that does not non-overlap with any operation already assigned to the functional unit fu. Notice that sharing the output register reduces the cost, while the other factors increase it. In general, binding has been shown to be NP-hard [Pangr 91], but various heuristics have been proposed. A common heuristic for solving the binding problem is called weighted bipartite match- ing [Huang 90]. The binding problem is represented using a bipartite graph with two vertex sets. The first vertex set corresponds to the operations being bound (i.e. LLVM instructions) during a particular control step. The second vertex set corresponds to the available functional units. A weighted edge is introduced from a vertex in the first set to a vertex in the second set if the corresponding operation is compatible with the corresponding functional unit. The cost given in Equation 2.4 of assigning an operation to a given functional unit is assigned to the edge weight connecting the corresponding vertices. After constructing the weighted bipartite graph, we wish to match each vertex from the first vertex set (operations) to exactly one of the connected vertices from the second set (compatible functional units) Chapter 2. Background and Related Work 19

addition1 addition2

3 4 1 5

adderFuncUnit1 adderFuncUnit2

Figure 2.11: Bipartite Graph. such that the overall cost (sum of edge weights) is minimized. The weighted bipartite matching problem can be solved optimally in O(n3) time using the Hungarian method [Kuhn 10]. We formulate and solve the matching problem one clock cycle at a time until the operations in all clock cycles (states) have been bound to an available functional unit. An example is shown in Figure 2.11, in this case we would match operation addition1 to functional unit adderFuncUnit2 and addition2 to adderFuncUnit1 for a minimum edge weight of 5. The weighted bipartite matching approach can also be used for variable binding, where one vertex set corresponds to program variables and the other set to registers. Another approach to binding is as a clique partitioning problem [Tseng 86]. In this formulation, graph vertices represent operations and an edge between two vertices indicates the associated operations can share hardware units. By finding the minimal number of cliques in the graph, we can minimize the hardware. Clique partitioning is NP-hard in general but heuristics can be used. To apply this approach to variable binding, each vertex represents a program variable and an edge between two vertices indicates the variables have non-overlapping lifetimes. Graph colouring approaches can also be used during variable binding for register allocation [Chait 81, Beida 05].

2.8 FPGA Architecture

In this section, we describe modern FPGA architecture specifically focusing on two commercial FPGAs from Altera: the Cyclone II [Cycl 04] FPGA (90nm) and Stratix IV [Stra 10] FPGA (40nm). We target these two FPGAs exclusively in the experimental results provided in this dissertation, due to the widespread availability of Altera’s DE2 [DE2 10b] and DE4 [DE4 10] development and education boards. Modern FPGA architecture consists of a two-dimensional array of logic array blocks, each consisting of a lookup table (LUT), registers, and some additional circuitry [Betz 99]. A k-input LUT can imple- ment any k-input logic function by using a programmable SRAM containing a 2k bit truth table and a 2k-input multiplexer to select the correct output. Stratix IV has a considerably different logic array block architecture than Cyclone II, as illustrated in Figure 2.12. Cyclone II uses logic elements (LEs) containing 4-input LUTs to implement combinational logic functions, whereas Stratix IV uses adaptive logic modules (ALMs). An ALM is a two-output 6-LUT that receives eight inputs from the FPGA in- terconnection fabric. The ALM can implement an arbitrary 6-input function, or two 4-input functions, or a 3- and 5-input function, or several other functions. In both FPGA architectures, the LUT output can either be used combinationally or registered sequentially. Much of the early work on HLS focused on targeting ASICs, but there are key differences between Chapter 2. Background and Related Work 20

DQ seq_out1

6−LUT comb_out1 D Q seq_out DQ seq_out2 4−LUT

comb_out comb_out2 (a) Cyclone II Logic Element (LE) (b) Stratix IV Adaptive Logic Module (ALM)

Figure 2.12: Cyclone II and Stratix IV logic element architectures. targeting ASICs and FPGAs devices. On an FPGA, multiplexing requires more device area compared to an ASIC, because multiplexers are implemented in LUTs on an FPGA device. A 32-bit wide 2-to-1 multiplexer implemented in 4-LUTs is the same size as a 32-bit adder. If we decide to share an adder unit, we may need a multiplexer on each input, making the shared version 50% larger than simply using two adders. In contrast, a 2-to-1 multiplexer implemented on an ASIC is implemented cheaply using two AND gates connected to an OR gate, or transmission gates. Another difference is that an FPGA has hardened ASIC-like blocks to implement multipliers (using DSP blocks) and memory (using block RAMs). If possible, DSP blocks should be used instead of implementing multipliers using LUTs. Likewise, the FPGA fabric is register rich – each logic element in the fabric has a LUT and a register. Therefore, sharing registers is rarely justified. Consequently, in HLS we must account for the underlying FPGA architecture to synthesize the best circuit. Chapter 3

LegUp: Open-Source High-Level Synthesis Research Framework

3.1 Introduction

In this chapter, we describe an open-source high-level synthesis (HLS) framework called LegUp. LegUp enables designers to use a simpler hardware design methodology, in which a software application imple- mented in C can be incrementally synthesized to target a hybrid FPGA processor/accelerator system- on-chip. This target architecture harnesses the energy and performance benefits of hardware, while the LegUp HLS framework raises the design-entry abstraction up to the software level. During the design process, the input C program is partitioned into functions executing on the processor and other functions that are synthesized automatically into digital logic, or hardware accelerators, that are imple- mented on the FPGA. During program execution, the processor automatically offloads computation to these hardware accelerators, resulting in better performance than the original software implementation. We have observed that robust open-source academic electronic design automation (EDA) tools can lead to significant new research progress by lowering the barrier to entry for new researchers. For example, the Versatile Place and Route (VPR) tool has been used by hundreds of FPGA researchers to perform studies on FPGA architecture, and for developing new place and route algorithms [Betz 97, Luu 14b]. Another example is the open-source ABC logic synthesis system, which has renewed academic interest in logic synthesis research [Mishc 06]. High-level synthesis and application-specific processor design can also benefit from the availability of a robust open-source framework such as LegUp, which has been missing in the research community. Currently, LegUp is intended to target FPGA-based embedded systems, which are application-specific systems typically implemented on a single board. Embedded systems that utilize FPGAs often include a soft processor, which is a processor implemented in lookup tables within the FPGA fabric [Nios 09b]. In embedded systems, particularly those using a soft processor, LegUp can significantly increase perfor- mance and energy-efficiency by performing computations in custom hardware instead of running them on the processor. Alternatively, LegUp could target a high performance , where a commodity general-purpose processor [Lempe 11] is connected to an FPGA board over the PCIe bus. In this scenario, we must be cognisant of off-chip bandwidth limitations when passing data to and from the FPGA, which can reduce or eliminate our performance gains from hardware accelerators. We focus

21 Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 22 primarily on the embedded system target architecture in this dissertation. In this chapter, we will present the LegUp design methodology and target architecture, as well as other implementation details. We will also present an experimental evaluation comparing LegUp to a commercial HLS tool. In this study, we measure LegUp’s ability to effectively explore the hardware/- software design space of a given program. The remainder of this chapter is organized as follows: Section 3.2 describes other HLS tools related to this work. Section 3.3 introduces the target hardware architecture and outlines the high-level design flow. The details of the high-level synthesis tool and software/hardware partitioning are described in Section 3.4. An experimental study appears in Section 3.5. Section 3.6 discusses recent research enabled by the LegUp framework. A summary is given in Section 3.7.

3.2 Background

3.2.1 Prior HLS Tools

As we discussed in Chapter 2, high-level synthesis, also known as behavioural synthesis or electronic system level (ESL) design, has been studied for over 30 years [McFar 88, Cong 11]. In this section, we survey various HLS tools, both academic and commercial, that have been developed recently. Several HLS tools have been developed to target digital signal processing (DSP) applications. The Riverside Optimizing Compiler for Configurable Circuits (ROCCC) [Villa 10] from UC Riverside is an open-source high level synthesis tool. ROCCC is designed to accelerate critical kernels that perform repeated computation on streams of data. These kernels are typical in DSP applications such as FIR filters, or fast fourier transforms. ROCCC is not designed for compiling entire C programs into hardware and many C features are unsupported, such as: pointers, shifting by a variable amount, non-for loops, and the ternary operator. ROCCC has a bottom-up development process that involves partitioning one’s application into modules and systems. Modules are C functions that are synthesized by ROCCC into a circuit datapath without any control logic. ROCCC fully unrolls any C loops within the module at compile time. These modules cannot access memory but have data streamed into them and output scalar values. Systems are C functions that instantiate modules to repeat computation on a stream of data or a window of memory and usually consist of a loop nest with special function parameters for streams. ROCCC supports advanced optimizations such as: systolic array generation, smart buffers, and temporal common subexpression elimination. ROCCC can also generate Xilinx PCore modules to be used with a Xilinx MicroBlaze processor [Micro 14]. However, ROCCC’s strict subset of C is insufficient for compiling any of the CHStone benchmarks (described in Section 3.4.5) and ROCCC does not support any resource sharing. Another high-level synthesis tool designed for DSP applications is GAUT [Couss 10] from University of South Brittany. GAUT synthesizes a single C function into pipelined hardware architecture consisting of a processing unit, a memory unit, and a communication unit described in VHDL. The tool also includes a graphical viewer to analyze the program data flow graph and the final HLS schedule. The user must specify the circuit throughput, specified as a pipeline initiation interval (see Chapter 5), and the clock period constraint. We now discuss a few recent HLS tools that target more general applications. xPilot [Cong 06a] is a state-of-the-art academic HLS tool, developed at UCLA, that has been used for numerous HLS studies Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 23

(e.g., [Chen 04, Cong 06c, Jiang 08, Cong 09, Wang 13]) and is now commercially released as Xilinx’s Vivado HLS tool [Xili]. The CHiMPS [Putna 08] project, developed by Xilinx and the University of Washington, synthesizes a C program into an FPGA circuit with many distributed caches, utilizing the available FPGA block RAMs, and supporting latency-hiding of off-chip memory accesses. Trident is an HLS compiler developed at Los Alamos National Labs, formerly called Sea Cucumber, which targeted floating-point scientific applications [Tripp 07]. Bambu [Pilat 12] and DWARV [Nane 12] are two other recent academic HLS tools, with Bambu offering support for custom floating point unit generation using FloPoCo [De Di 11]. Of all the academic tools, the recently released Shang [Shan 13] is most comparable to LegUp, supporting a hybrid flow where software is executed on an Altera Nios II soft processor. Shang is built on LLVM (like LegUp) but instead of working with LLVM IR instructions, Shang works on the LLVM machine code layer allowing further area optimizations. Shang also uses multi-cycle path analysis [Zheng 13] to efficiently chain operations, allowing for better performance and area. Although we focus on C as the high-level input language in this dissertation, there have been other proposed languages that can offer better hardware expressibility. The most popular alternate is Sys- temC [Syste 02], used by Forte [Fort] among others. SystemC is a C++-based library offering a modeling platform that supports a flexible input ranging from untimed C to a cycle-accurate RTL-like descrip- tion using the SystemC class library. SystemC ships with a built-in simulation kernel that can perform cycle-accurate simulations orders of magnitude (1000×) faster than an equivalent RTL-based simula- tion [Coolea]. Proposed HLS input languages also have included C-based language extensions. For example, SpecC [Gajsk 00], developed at UC Irvine, added support for state machines and hardware pipelines. Other examples of C-based languages include: HardwareC [Ku 88] from Stanford University and Handel-C [Aubur 96] from Oxford University. IBM Research has developed a HLS compiler called LiquidMetal that uses an object-oriented Java-like language, LIME, which includes hardware-specific extensions, such as bitwidth-specific integers [Huang 08]. Other high-level synthesis languages offer user-explicit parallelism such as the general purpose GPU language OpenCL [Openc 09] used by Al- tera’s OpenCL HLS tool [Open]. BlueSpec [Blue] uses the Haskell functional language to specify circuit behaviour using the guarded atomic action model, which consists of guard/action pairs. Each guard is a boolean function that triggers an atomic action to occur, which modifies the circuit state. Our described design methodology bears some similarity to the FPGA-based Warp Processor devel- oped at UC Riverside [Vahid 08]. Their approach starts by profiling the software binary executing on the Warp Processor. Using this profiling data, they automatically select critical regions of the software binary to synthesize into a custom digital circuit on the FPGA. Before synthesizing the circuit, they dis- assemble the region of the software binary into a higher-level representation suitable for HLS [Stitt 07]. They take the circuit description produced by HLS and run FPGA CAD tools to synthesize the circuit into a programmable bitstream for the FPGA target device. They reprogram the FPGA using this bitstream. Next, the Warp processor transparently modifies the software binary to call the hardware during the appropriate software region, dynamically improving the speed and energy consumption of the program. LegUp’s design methodology is similar, but we synthesize our custom hardware accelerators directly from the software source code, instead of the disassembled binary, enabling us to generate a better final hardware circuit using HLS. The Warp processor was never publicly released or developed commercially. We show a summary of the release status of the surveyed tools in Table 3.1. Tools are broken into Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 24

Table 3.1: Release status of recent non-commercial HLS tools. Open-source Binary-only No source or binary Trident xPilot Warp Processor ROCCC GAUT LiquidMetal Shang SPARK CHiMPS Bambu DWARV three categories: 1) the source code is available, 2) only a binary is available, or 3) neither source or binary are available. Binary-only tools are only useful for benchmarking and cannot be modified by researchers trying to investigate new HLS algorithms. Tools without a binary release cannot have their published results independently verified. We now discuss a few shortcomings of the currently available open-source HLS tools that motivate the creation of our new open-source HLS framework. The Trident tool is implemented using an older LLVM version and has not been actively maintained for several years. Trident also only synthesizes pure hardware designs and cannot support a hybrid hardware/processor architecture. ROCCC is under active development, but lacks the C language support required for our desired benchmark programs. Bambu is built using the GCC compiler [Stall 99] and is still under active development. The tool supports the CHStone benchmark suite but only targets a pure hardware flow (no processor). Shang is the most comparable open-source HLS tool to LegUp, with source code released in the middle of 2013. However, development appears to have stopped at the end of 2012, based on their source control system. We found the code less well-tested (many segfaults) and harder to install compared to LegUp but the project looks promising overall. When LegUp 1.0 was released in 2011, it was the only open-source HLS tool compiling a complete C program to a hybrid processor/accelerator system architecture, where the synthesized hardware comprised a general datapath/state machine model. We have found that LegUp provides researchers with a robust infrastructure supporting larger and more general C programs than those handled by prior open-source tools. Commercial HLS tools have been gaining traction in recent years, with both start-ups and major EDA vendors offering HLS tools. From FPGA vendors, Xilinx offers Vivado HLS [Xili], formerly Au- toPilot, while Altera has released the OpenCL compiler [Open]. From major EDA vendors, there is Catapult C from Mentor Graphics [Caly], and C-to-Silicon [Cade] and Forte [Fort] (recently acquired) from Cadence, and Synphony HLS, formerly Synfora PICO, from Synopsys [Synp 15]. From smaller HLS companies, there is eXCite from Y Explorations [eXCi 10], CoDeveloper from Impulse Accelerated Technologies [Impu], C2R from CebaTech [Ceba], and CyberWorkBench [Wakab 06] from NEC. Altera’s commercial C2H tool [Nios 09a] (deprecated in Quartus 9.1) targeted a system architecture similar to LegUp’s target architecture. C2H required the user to categorize a C program’s functions as either hardware or software-based. After C2H generated the system, the software-based functions would execute on a Nios II soft processor [Nios 09b], while the hardware-based functions would be synthesized into custom hardware accelerators. These hardware accelerators were connected to the Nios II processor using an Avalon interface (Altera’s on-chip interconnection standard). However, C2H lacked enough coverage of the C language to compile our benchmark suite.

3.2.2 Application-Specific Instruction-Set Processors (ASIPs)

Application-specific instruction-set processors (ASIPs) [Pothi 10, Pozzi 06, Sun 04] are embedded pro- cessors which offer support for adding new custom instructions to augment their existing instruction set Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 25 architecture. If we reconfigure the processor datapath to include useful application-specific instructions, we can then improve the program performance and energy consumption compared to a general-purpose processor. We perform a profiling step to analyze the application for critical regions before deciding which custom instructions to implement in the ASIP. We can use pattern matching techniques on the profiling data to recognize commonly executed sequences of program instructions which we can group together to form a new custom instruction. At this point, HLS is used to synthesize the custom datap- ath required for the custom instruction and then to resynthesize the ASIP. Finally, the software code is updated to utilize the new custom instruction. LegUp’s hybrid processor/accelerator target architecture has two main differences when compared to an ASIP. Firstly, custom instructions require the ASIP to stall while the hardware performs compu- tation whereas in LegUp, our loosely-coupled processor/accelerator architecture permits the hardware accelerators and the processor to run in tandem. Secondly, LegUp can synthesize large portions of a C program into hardware and is not limited to synthesizing small groups of instructions like the ASIP. ASIPs have an advantage for very small hardware accelerators because they can access the processor’s register file.

3.3 LegUp Overview

In this section, we describe LegUp’s design methodology and the target architecture. Implementation details will follow in Section 3.4.

3.3.1 Design Methodology

In LegUp’s design methodology, the user begins with a C program which they compile and run on a processor to gather profiling information. Using this data, they partition the program into either software or hardware regions and recompile. Hardware regions are automatically synthesized by LegUp into digital logic on the FPGA and software regions execute in tandem on the processor. We provide a detailed flow chart for this methodology in Figure 3.1. In the first step, the user compiles a standard C software input program using the LLVM C compiler. The resulting binary is executed on an FPGA- based processor such as the Tiger MIPS soft processor [Tige 10] or a hardened on-chip ARM Cortex-A9 processor [ARM 11]. In the case of the Tiger MIPS processor, we have added custom profiling logic to the processor datapath to measure cycle counts accurately [Aldha 11a]. We choose hardware profiling to avoid the need to add instrumentation to the software program for profiling which can slow down program execution. Consequently, we can transparently obtain very accurate profiling measurements including exact cycle measurements of off-chip memory accesses and processor cache misses. We currently profile at the granularity of individual functions in the program. In the next step of our flow, hardware/software partitioning, the user analyzes the profiling data to identify critical functions in the program that could benefit from hardware acceleration. These are functions that could be synthesized into hardware with greater performance or improved energy consumption. The partitioning process is currently manual and requires the user to mark each function that should be implemented in hardware by LegUp in a Tcl configuration file. The remaining functions execute on the processor. The final step of the flow is labeled “LegUp” in Figure 3.1 because this step is what we typically refer to as LegUp in this dissertation. LegUp requires a C compiler and an HLS tool, both of which are built within the LLVM compiler framework. At this stage, the user has chosen functions to synthesize into Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 26

C input program

C compiler

Self−Profiling Profiling Data Processor

Hardware/Software Partitioning

High−Level C compiler LegUp Synthesis

FPGA

Processor Hardware Accelerator

Figure 3.1: LegUp Design Methodology. hardware, which are then passed into our high-level synthesis flow to generate a hardware description that can be synthesized onto the FPGA as a hardware accelerator. Each hardware-based C function will be synthesized into a separate hardware accelerator. Furthermore, if a function calls another function then these called functions are also synthesized into hardware. Currently, only software functions executing on the processor can call hardware accelerators; the reverse is not supported. In the final stage, the LegUp C compiler modifies the software to use the hardware accelerators. For each hardware-partitioned function, LegUp adds specific code that will start the corresponding hardware accelerator and pass data between the processor and accelerator. We then execute this modified software on the FPGA-based processor. At the bottom of Figure 3.1, we show the target system architecture consisting of a processor connected to multiple hardware accelerators on an FGPA device. In this design methodology, the user can harness the performance and energy benefits of an FPGA us- ing an incremental methodology while limiting time spent on hardware design. The LegUp programming flow bears some similarity to general-purpose GPU programming using the languages CUDA [CUDA 07] and OpenCL [Openc 09] in the sense that we allow the programmer to iteratively and incrementally work to achieve a speedup, with the whole program working at all times.

3.3.2 Target System Architecture

LegUp’s target FPGA-based system architecture is shown in Figure 3.2. We included a processor in our target system to support C program code that is inappropriate for hardware implementation. For example, searching a linked list in software is inherently sequential and will achieve limited speedup in hardware. However, highly parallel code, such as vector addition (Figure 3.5), can achieve much Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 27

FPGA To Off−Chip On−Chip Memory Processor Cache

Avalon Interconnect

Hardware Hardware Memory ... Memory Accelerator Accelerator

Figure 3.2: LegUp target hybrid processor/accelerator architecture. better performance by leveraging the available logic and parallel computation enabled by the FPGA. Furthermore, offering the user a choice of running portions of the program on a processor increases the range of allowable input programs. Functions within the program that require language features unsupported by hardware accelerators, for example recursion or dynamic memory, can be executed on the processor. LegUp targets the Tiger MIPS soft processor from the University of Cambridge [Tige 10] which sup- ports the full MIPS instruction set, has a mature tool flow, and is described in well-documented modular Verilog. Mark Aldham evaluated two other FPGA-based soft processors to target with LegUp [Aldha 11b]: YACC [YACC 05] and SPREE [Yiann 07]. Alternatively, Blair Fort is finalizing LegUp support for targeting the on-chip ARM Cortex-A9 processor [ARM 11] available on the Altera DE1-SoC board [DE1 13], which includes a Cyclone V SoC FPGA device (for details see [Fort 14]). The Cortex-A9 processor operates at a 800 MHz clock frequency, which is significantly faster than the Tiger MIPS soft processor running at 74 MHz. Bain Syrowik benchmarked the Cortex-A9 and measured a ge- omean wall-clock time speedup of 9.4× compared to the Tiger MIPS for software execution across the CHStone benchmarks [ARM 14]. Furthermore, the Cortex-A9 includes 128-bit SIMD extensions (NEON) [ARM 11], which are very applicable to the benchmarks targeted by LegUp. However, our preliminary experiments have found that using NEON only offers a 20% geomean wall-clock time im- provement across the CHStone benchmarks [ARM 14]. We could improve this performance by using hand-written NEON assembly instructions instead of relying on the LLVM compiler backend. In this chapter, we only target the Tiger MIPS soft processor for software execution. The processor connects to one or more custom hardware accelerators through a standard on-chip interconnect. Presently, we use the Altera Avalon interconnect [Aval 10] for communication between the processor and the hardware accelerators with LegUp automatically generating the Avalon interface using Altera’s SOPC [Docu 11] builder tool (now deprecated). We are also adding support for generating the interconnect with Altera’s new interface generator: Qsys [Qsys 14]. The Avalon interconnect is implemented as point-to-point connections between communicating hardware modules instead of as a shared bus, for greater performance. In this system, hardware accelerators do not communicate with other hardware accelerators, only with the processor. The interconnect allows the processor and accelerators to communicate through a memory-mapped interface. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 28

Our target architecture has a shared memory system, with all memory accesses from either the processor or hardware accelerators going through an on-chip memory cache. The on-chip memory cache is based on the Tiger MIPS data cache but has been heavily modified by Jongsok Choi. The memory within the cache is implemented using fast FPGA block RAMs. If memory is not contained within the cache, the cache controller requests data from off-chip main memory. Having a single-level shared cache simplifies our architecture by not requiring cache coherency between multiple caches. Further detail on the cache architecture can be found in [Choi 12a]. The Tiger MIPS processor also contains a separate instruction cache. We call the memory accessed through the on-chip cache, processor memory, because this memory is shared between the processor and hardware accelerators. We distinguish processor memory from the memory architecture within each hardware accelerator, which stores any memory (constants and local variables) that is not shared with the processor. This memory is stored in FPGA block RAMs and allows the hardware accelerator to use memory locally without possible contention when accessing the Avalon interconnect, enabling greater parallelism and performance. This memory also avoids cache misses and the associated latency to fetch off-chip memory. Memory within a hardware accelerator is handled by a separate memory controller as described in Chapter 6. We concede that our performance may be limited by the shared memory cache if we instantiate many hardware accelerators in this system that share memory with the processor. Fixing this bottleneck is outside the scope of this dissertation and we leave improvements to the processor/accelerator memory architecture as future work. We support various target FPGA devices that are available on Altera development and education boards: the DE2 [DE2 10b] (Cyclone II FPGA), the DE2-115 [DE2 10a] (Cyclone IV FPGA), the DE4 [DE4 10] (Stratix IV FPGA), the DE1-SoC [DE1 13] (Cyclone V SoC FPGA), and the DE5- Net [DE5 13] (Stratix V GX). We note that most prior work on high-level hardware synthesis has focused on pure hardware implementations of C programs, not the hybrid software/hardware system we target in LegUp.

3.4 LegUp Design and Implementation

Before implementing LegUp, the author investigated two open-source compiler frameworks to leverage for our work: GCC [Stall 99] and LLVM [Lattn 04]. The GNU Compiler Collection (GCC) is a robust open-source compiler that is ubiquitous in the Linux community. GCC also compiles code that executes 5-10% faster than code compiled by LLVM. However, the GCC compiler has a steep learning curve due to a large complex C codebase, with heavy use of C global variables and macros. Furthermore, we have no static-single assignment (SSA) intermediate representation in the compiler backend passes. In contrast, the low-level virtual machine (LLVM) compiler has great documentation with a modular understandable C++ design. Adding new compiler passes and targets in LLVM is easy with a standard class API. LLVM also offers access to a consistent SSA intermediate representation at every stage of the compiler. The LLVM open-source license was also favourable, with an unrestricted BSD-style license [Rosen 05]. For these reasons, the author built LegUp within the LLVM compiler framework. We programmed LegUp using modular C++, with HLS algorithms implemented as backend compiler passes that fit into the existing LLVM compiler framework. We have logically divided LegUp C++ classes into the HLS steps previously discussed in Figure 2.2. Researchers can implement their own HLS algorithms as drop-in replacements for the existing algorithms in LegUp. As we discuss later in Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 29

Section 3.4.6, users can easily verify circuit functionality and measure the quality of results across a suite of benchmarks after making modifications to LegUp. The author implemented a datastructure to represent the RTL description of the final circuit. After scheduling and binding, there is a hardware generation pass that converts the LLVM instructions into this final RTL datastructure. Then, LegUp has a final pass that writes out the RTL datastructure as a synthesizable Verilog circuit description file. In the original implementation of LegUp (described in this chapter), the author implemented a list scheduler using as-soon-as-possible (ASAP) ordering as the priority. The author also implemented the bipartite weighted matching [Huang 90] binding algorithm within LegUp. Some improvements to allow operator chaining in the scheduler were also implemented by Victor Zhang. The hybrid flow, including the interconnection generation and communication between the processor and hardware accelerators, was implemented by Jongsok Choi. The hardware profiler within Tiger MIPS was implemented by Mark Aldham. We focus on LegUp’s first release for the experimental results in this chapter. Since then, Jason Anderson has implemented a new scheduler based on SDC scheduling [Cong 06b] (see Chapter 2), using the open-source lpsolve linear programming library [lpso 14]. Later, Stefan Hadjis implemented an area saving algorithm for sharing groups of smaller functional units that ap- pear in a particular configuration, or pattern, more than once in the program, as described in the co-authored [Hadji 12a]. Many other changes have been implemented since then by: Ruo Long (Lanny) Lian, Nazanin Calagar, Li Liu, Marcel Gort, Blair Fort, Bain Syrowik, Joy (Yu Ting) Chen, Julie Hsiao, Victor Zhang, Ahmed Kammoona, Kevin Nam, Qijing (Jenny) Huang, Ryan Xi, Emily Miao, Yolanda Wang, Yvonne Zhang, William Cai, and Mathew Hall. The author has acted in a mentorship role by advising these contributors on the best approach to extend LegUp. During this dissertation, the author will make an effort to attribute any discussed LegUp functionality to the implementation author.

3.4.1 Hardware Modules

In the hardware generated by LegUp, each function from the input software program will result in a distinct hardware module; except for small functions that are automatically inlined by the LLVM compiler. LegUp avoids inlining every function by default because we can save area when functions are called more than once in the program. If we inline a function in two places then the final circuit will have duplicated hardware. Since LegUp performs resource sharing at the individual operator level, we can have difficulty sharing this large duplicated hardware region automatically. We also implement each function as a separate hardware module to simplify the hybrid system generation, where the user can specify an individual function to accelerate as discussed in Section 3.4.4. We found that hardware simulations are clearer when debugging functions that are implemented in separate modules. LegUp also supports manually changing the function inline threshold to inline larger functions. Increasing the inline threshold can achieve higher performance by allowing further optimizations across function boundaries and it can reduce hardware cycles spent communicating between hardware modules at the cost of area. We found empirically that forcing LegUp to inline all functions in the CHStone benchmark suite reduced geomean wall-clock time by 9% but increased circuit area by 15%. Given the following C function prototype in the input software program: int function( int a, int ∗ b);

LegUp would generate a hardware module with the interface given in Figure 3.3. The first two inputs Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 30

1 module function (...) 2 input clk ; 3 input reset ; 4 5 input [31:0] a; 6 input [31:0] b; 7 8 input start ; 9 output reg finish ; 10 11 output reg [31:0] memory controller address ; 12 output reg memory controller enable ; 13 output reg memory controller write enable; 14 output reg [31:0] memory controller in ; 15 input [31:0] memory controller out ; 16 input memory controller waitrequest ; 17 18 output reg [31:0] return val ; 19 endmodule

Figure 3.3: LegUp hardware module interface.

Reset = 1 Start = 0

Initial State

Start = 1

Figure 3.4: Initial state of a LegUp hardware module’s finite state machine.

are the clk and reset signals. The function parameters are on input ports a (32-bit integer) and b (32-bit memory address). The two ports start and finish, are the module control flow signals. The protocol for starting a module is to place valid inputs on the function parameter ports and then assert the start input. Each hardware module contains a finite state machine that controls the datapath. The start input is monitored by the first state of the module’s finite state machine as shown in Figure 3.4. We remain in the first state until the start input is asserted. The finish output is kept low until the last state of the state machine, when finish is asserted to indicate to the caller module that this hardware module is finished. Line 18 provides a registered output port, return val, which is set to the function’s integer return value. When finish is high the return val output port should be driven with valid data. Lines 11–16 contain the interface to the global memory shared with the rest of the system. Table 3.2 provides a description of these memory signals. The memory architecture and instantiation hierarchy are explained further in Chapter 6.

3.4.2 Device Characterization

Each target FPGA device offers different speed, area, and power properties for generated hardware functional units, such as adders or multipliers. Consequently, LegUp includes a PERL script written by Ahmed Kammoona to characterize each hardware operation, including all supported bitwidths (8, 16, 32, 64), for a given Altera FPGA family. The script synthesizes each operation in isolation for the Altera FPGA family using Quartus II. Then, the script parses out the operator propagation delay, the number of logic elements and registers, and DSP block usage from the timing analysis report. The script Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 31

Table 3.2: LegUp memory signals. Memory Signal Description memory controller address Memory address (32-bit) memory controller enable Memory clock enable memory controller write enable If one then write to memory otherwise read memory controller in Data to be written into memory memory controller out Data to be read out of memory memory controller waitrequest If one then hold the module current state constant also supports running simulations to measure the estimated operator power consumption by randomly toggling the input ports. Characteristics associated with each LLVM operator are generated by the script and then stored in a Tcl configuration file. LegUp reads this characterization file during the allocation stage of HLS and uses operator characteristics during scheduling and binding. We can also use this data to make early estimates of circuit speed and area for the hardware accelerators. These scripts have been improved by Ryan Xi, who added floating point operations, and Joy (Yu Ting) Chen, who characterized Cyclone V and Stratix V FPGAs.

3.4.3 Hardware Profiling

As discussed in Section 3.3.1, the Tiger MIPS soft processor has been modified to include a hardware profiler designed by Mark Aldham. The hardware profiler transparently monitors the processor operation during program execution to gather performance characteristics and identify critical functions. By default, the profiler keeps track of the number of clock cycles spent in each function of the running program, but the architecture is extensible to allow measurement of other metrics (i.e. energy). A set of performance counters, one for each function, are maintained. To keep track of the currently executing function, the profiler monitors the processor bus for function call and return instructions and uses a function call stack. The profiler is compatible with any input program and therefore does not require any changes to the underlying hardware for different programs. Mark has shown that the hardware profiler requires a 6.7% area overhead on the Tiger MIPS processor when configured to support 32 functions using 32-bit performance counters. Complete details on the profiler, including a description of profiling the estimated energy consumption of a program, can be found in [Aldha 11a]. The ARM Cortex-A9 processor also supports profiling as investigated by Bain Syrowik [ARM 14]. Bain’s approach uses ARM event counters to track function calls and to sample the number of clock cycles spent in each function.

3.4.4 Hybrid Processor/Accelerator System

As discussed in Section 3.3.2, our proposed target architecture is a hybrid system with the processor communicating with hardware accelerators. In this section, we discuss the behaviour of the processor and modifications to the software required to support offloading computation to the hardware accelerators chosen by the user. Jongsok Choi implemented the LegUp hybrid flow. In the hybrid flow, the user selects functions that they wish to synthesize into hardware accelerators. To utilize each accelerator, LegUp must change to the software program to perform these steps: 1) pass the function arguments from the processor to the hardware accelerator, 2) start the accelerator, 3) wait for hardware computation to finish, and 4) retrieve any resultant data from the hardware. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 32

1 void vector add ( int ∗A, int ∗B, int ∗C, int N) { 2 for ( int i =0; i < N; i++) { 3 C[i]=A[i]+B[i]; 4 } 5 }

Figure 3.5: Vector addition C function targeted for hardware.

In step three, after activating a hardware accelerator, the processor must wait for the computation to complete. During this time, we can continue program execution on the processor if we do not immediately need results from the hardware accelerator. Alternatively, we can halt processor execution until the accelerator is finished. The first approach is implemented by polling a memory-mapped status register residing on the hardware accelerator that indicates when computation is finished. Polling can have a performance advantage by allowing parallel computation on the processor while we wait for the accelerator to finish and before we begin the polling loop. The second approach of stalling the processor is simpler to implement and we can save energy by idling the processor. LegUp supports both behaviours but we use the second approach (stalling) in the experimental study presented in this chapter. For example, assume our program contains the C function shown in Figure 3.5, which we wish to implement in hardware. The function performs a vector addition over two N-element arrays, which are passed as the first two function parameters, and stores the output vector in the third parameter. These arrays are stored in processor memory, which is shared between the processor and accelerators. To transparently accelerate this function in hardware without changing the rest of the program, we replace the original vector add function with the new function shown in Figure 3.6. The hardware accelerator memory-mapped address space in LegUp is: 0xF0000000–0xFFFFFFFF,with hardware accelerators placed immediately after each other in the address space. In our address space, any address above 0x80000000, with the most significant bit equal to one, does not go to the on-chip memory cache. Rather, it is reserved for external I/O, which consists of the hardware accelerators, for details see [Choi 12b]. The Avalon interconnect and logic implemented in the Avalon slave component for each accelerator handles the address decoding for the address range of each accelerator. In Figure 3.6, we first perform memory- mapped stores for each argument of the function on lines 10–13, which will store the arguments in dedicated registers on the hardware accelerator. We then start the accelerator by writing to the START address on line 15, which also immediately stalls the processor. When the accelerator finishes, the Avalon waitrequest signal will be de-asserted, allowing the processor to resume execution. Although not shown here, functions with a return value will have an additional memory-mapped load from the accelerator at the end of the function to retrieve the return value.

3.4.5 Language Support and Benchmarks

LegUp has extensive language coverage of ANSI C allowable for hardware synthesis as summarized in Table 3.3. LegUp supports integer and floating point arithmetic and all logical comparison, ternary, and bitwise operators. We support arbitrary control flow including any type of loop, switch, if-statement, goto-statements and function calls. We support memory including global variables and constants, multi- dimensional arrays, arbitrary pointers, and pointer arithmetic. Generally, we support a wider range of language constructs than other academic HLS tools. Victor Zhang added support for structs to LegUp, including structs with arrays, arrays of structs, and structs containing pointers. LegUp stores structs in Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 33

1 // hardware accelerator memory−mapped address space starts at 0xF0000000 2 const volatile int ∗VECTOR ADD START = 0xF0000000 ; 3 const volatile int ∗VECTOR ADD ARG A = 0xF0000004; 4 const volatile int ∗VECTOR ADD ARG B = 0xF0000008; 5 const volatile int ∗VECTOR ADD ARG C = 0xF000000C; 6 const volatile int ∗VECTOR ADD ARG N = 0xF0000010; 7 8 void vector add ( int ∗A, int ∗B, int ∗C, int N) { 9 // pass arguments to hardware accelerator using memory−mapped stores 10 ∗VECTOR ADD ARG A = A; 11 ∗VECTOR ADD ARG B = B; 12 ∗VECTOR ADD ARG C = C; 13 ∗VECTOR ADD ARG N = N; 14 // start the hardware accelerator and stall the processor un til finished 15 ∗VECTOR ADD START = 1 ; 16 }

Figure 3.6: Modified C function to call hardware accelerator for function in Figure 3.5.

Table 3.3: LegUp C language support. Supported Unsupported Functions Dynamic Memory Arrays, Structs Recursion Global Variables Function Pointers Pointer Arithmetic Unaligned Memory Accesses Floating-point Arithmetic

memory using the ANSI C alignment standards to ensure that any hardware function can access struct elements allocated in the processor’s memory without requiring extra changes. Structs are stored in 64-bit wide block RAMs to ensure that elements up to 64-bits in size can be accessed in a single memory operation. Language features unsupported by LegUp include: dynamic memory allocation, recursion, function pointers, and functions that return a struct. Another limitation is that all memory accesses should be word aligned; for example, you cannot load only the upper byte of an integer. The user could hypothetically compile their own custom memory allocator library to support dynamic memory. We provide an example of such a library implemented by Victor Zhang in the included LegUp benchmarks, where we store all dynamic memory in a predefined statically-sized 1024B memory heap. Recursion could be supported in the future by adding a stack controller as described in [Jasro 04]. Any regions of the program using unsupported features should remain in software running on the processor.

Table 3.4: Core benchmark programs included with LegUp. Category Benchmarks Lines of C Arithmetic Double-precision floating-point 363–789 Add, Mult, Div, Sin Encryption AES, Blowfish, SHA 723–1,413 Processor MIPS processor 232 Media JPEG decoder, 441–1,692 Motion vector decoding Communications GSM, ADPCM 380–547 Synthetic Dhrystone 491 Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 34

LegUp includes a suite of benchmark C programs that the user can use to evaluate the HLS quality of results. We show a list of the 13 benchmarks in Table 3.4, which includes all 12 benchmarks from the CHStone high-level synthesis benchmark suite [Hara 09] and Dhrystone [Weick 84], a popular synthetic benchmark. We chose the benchmarks to be representative of the types of programs synthesized by an HLS tool. We include programs from the following categories: floating-point arithmetic, encryption and hashing, processor emulation, media, communications, and synthetic. The benchmarks range in size from 232–1692 lines of C code. The arithmetic benchmarks implement various double-precision floating- point operations using mainly bitwise operations on 64-bit wide integer types. The processor benchmark emulates a basic MIPS processor for a predefined program implemented in MIPS machine code. These benchmarks require substantial C language coverage to be synthesized by a HLS tool, for instance, the Dhrystone benchmark contains structs that are used as a linked list. In fact, before LegUp was released, no academic tool was robust enough to support the entire CHStone benchmarks — even commercial tools generated functionally incorrect circuits for some of these benchmarks. Furthermore, these benchmarks are larger than prior benchmarks used in academic publications. For example, the HLS area minimization work by Cong [Cong 10] and Zhang [Zhang 10] uses benchmarks that range from 600–4500 slices on a Xilinx Virtex-4 FPGA (90nm). Each Virtex-4 slice contains two 4-input LUTs [Virt 10]. On a Cyclone II FPGA (90nm) each logic element (LE) contains a single 4-input LUT [Cycl 04]. As a first order approximation, we can assume that each Virtex-4 slice corresponds to roughly two Cyclone II LEs. Therefore, their biggest benchmark is 4500 slices, which is under 10,000 LEs. The geomean circuit area for the LegUp benchmark suite is 50% larger (over 15,000 LEs) with the jpeg benchmark consuming over 46,000 LEs, an order of magnitude larger. Typically 5-6 benchmarks are used to perform HLS experimental studies. LegUp offers researchers with double this number of benchmarks, which are all full programs instead of individual functions or kernels. At the time of our first publication [Canis 11], to our knowledge, these were the largest HLS benchmarks that have ever been synthesized by an academic tool in the literature. Until LegUp was released, the CHStone benchmarks could not be studied in depth simply because no academic tool could support them. Therefore, a key differentiator of LegUp relative to prior work is that we allow researchers to study HLS when applied to larger and more complex C programs than before.

3.4.6 Circuit Correctness

Circuit correctness is the most important requirement of LegUp which is why we distribute LegUp with a rigorous automated test suite including hundreds of tests. We must verify that the generated RTL design simulates correctly and produces a functionally correct circuit under a wide range of input programs so that academics can spend time on research instead of debugging infrastructure. In other CAD areas, such as place and route, a bad placement will still be functionally correct and we can easily verify that our final placement matches the original netlist. In contrast, verifying that the C input matches the circuit output of high-level synthesis is non-trivial. Consequently, high-level synthesis research and development is inherently prone to introducing bugs or regressions in the final circuit functionality. Even a single misplaced register or an operation scheduled one cycle too soon can break the functionality of the final circuit! Furthermore, manually debugging the auto-generated RTL code generated during HLS can be challenging and tedious. Our test suite helps give academic end-users confidence in the core LegUp HLS algorithms and they can use these tests to verify circuit correctness after implementing novel algorithms in LegUp. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 35

The CHStone [Hara 09] benchmarks we described earlier each contain built-in input vectors that exercise the program execution and golden output vectors to verify that the program generated the correct output. Consequently, when the programs are synthesized into hardware, we can simulate or run the circuit on the FPGA device to verify the correct functionality using the golden input and output test vectors. This is analogous to built-in self-test techniques [McClu 85] used for verifying chip functionality, with no user input required. These test vectors are also marked by the volatile keyword to avoid the LLVM compiler performing constant propagation and agressive optimizations. For example, the mips CHStone benchmark contains a predefined unordered 8-element array as input, and the corresponding 8- element sorted array as the golden output. The MIPS processor emulates a 44 instruction MIPS program that sorts the array, after which the program verifies that the sorted array matches the expected golden output. For tracking down circuit correctness issues, LegUp also provides a pass that annotates each basic block with a print statement that dumps out the value of every register assigned in the basic block. We can then synthesize this new program to hardware and simulate the circuit to generate a log of all register changes. Next, we compare the simulation log to the output from running the annotated program in software, which prints out the correct register values. We can quickly discover the exact point during the hardware simulation when the register values are incorrect. In practice, we found this procedure helpful for debugging incorrect circuits synthesized by LegUp. Nazanin Calagar recently proposed a debugger for LegUp [Calag 14] that has a similar feature, which compares the execution state between software and an equivalent synthesized circuit running on the target hardware device.

3.4.7 Extensibility of LegUp to Other FPGA Devices

We now describe the extensibility of the LegUp framework to target other FPGA devices, in particular Xilinx FPGAs, which may be the only FPGAs available to some researchers. If we wish to target a new FPGA device with LegUp, then our first step is to re-characterize all the hardware operation propagation delays and area metrics using the script described in Section 3.4.2. LegUp requires these updated metrics to accurately schedule and bind the operators (e.g. shift, add) in the program to meet user timing and area constraints for the target device. Our characterization script assumes an Altera FPGA target, therefore the script should be rewritten if we need to target a Xilinx FPGA. We designed the format of the Tcl configuration file containing the operator characteristics to be FPGA agnostic. Second, the Verilog hardware description generated by LegUp requires vendor-specific hardware mod- ules for: floating point operations, integer divide, and integer multiply. These operations are implemented by instantiating Altera-specific megafunctions. In LegUp, memories are inferred by Verilog statements, which should be vendor-agnostic. However, if structs are used in the C program then block RAMs are instances of Altera’s ALTSYNCRAM megafunction, because byte-enables are not supported by RAMs inferred in Verilog. To target a new FPGA, we would need to replace all of these Altera megafunctions with either the equivalent vendor-specific hardware modules for the target FPGA or with generic vendor- agnostic hardware modules. The advantage of using generic hardware modules is we could support an academic tool such as the open-source VTR (Verilog-to-Routing) FPGA CAD flow being developed here at the University of Toronto [Luu 14a]. An option for generating generic floating-point functional units is to use FloPoCo [De Di 11], or alternatively to use LegUp to synthesize floating-point cores from a C implementation. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 36

So far, we have only discussed changes required when targeting a new FPGA with the hardware-only LegUp flow. However, supporting another FPGA target for the hybrid processor/accelerator architecture would require a few additional changes. First, the Tiger MIPS processor contains Altera megafunctions for memory, multiplication, and division. These would have to be changed to equivalent vendor-specific modules on the new FPGA. Next, the interconnect between the processor and the hardware accelerators would have to be changed from Altera-specific Avalon to another bus protocol, such as the Advanced Microcontroller Bus Architecture [AMBA 03] defined by ARM and used by Xilinx FPGAs. We should ensure that this new interconnect is in the same memory-mapped point-to-point configuration as the current Avalon bus.

3.5 Experimental Study

In this section, we present an experimental study performed using the first version of LegUp in 2010. This study has three objectives. First, we would like to compare the quality of results in terms of speed, area, and energy of the circuits produced by LegUp compared to those synthesized by a commercial HLS tool. We chose the commercial HLS tool eXCite [eXCi 10] because it was the only HLS tool we were given access to that could compile our benchmark programs. eXCite has been actively developed since 1995 after being spun out of HLS research at UC Irvine and it is representative of commercially available HLS tools. Second, we want to investigate hardware/software partitioning choices for our benchmarks to explore the available design space. Third, we want to quantify the improvement of a LegUp hardware implementation compared to executing our benchmarks on a processor. To achieve these objectives, we ran five different experiments across the benchmark suite, starting from a software-only implementation and then successively increasing the amount of computation performed in synthesized hardware. The experiments are as follows (with labels appearing in parentheses):

1. A software-only implementation executing on the MIPS soft processor (MIPS-SW ).

2. A hybrid software/hardware implementation where the second most 1 compute-intensive function and its function descendants is implemented as a hardware accelerator, with the rest of the bench- mark running in software on the MIPS processor (LegUp-Hybrid2 ).

3. A hybrid software/hardware implementation where the most compute-intensive function and all its descendants is implemented as a hardware accelerator, with the rest executing in software (LegUp-Hybrid1 ).

4. A hardware-only implementation synthesized by LegUp, with no processor (LegUp-HW ).

5. A hardware-only implementation synthesized by eXCite (eXCite-HW )2.

The two hybrid flows target a system with the Tiger MIPS processor and a single hardware accelerator synthesized from one C function and all of its descendant functions. We used Quartus II 9.1 SP2 to target the Cyclone II FPGA on the DE2 board [DE2 10b], in timing- driven mode with all physical synthesis optimizations turned on3. We verified the circuit correctness of

1Not considering the main() function. 2The eXCite implementations were produced by running the tool with the default options. 3The eXCite implementation of the jpeg benchmark was synthesized without physical synthesis optimizations turned on in Quartus II, as with such optimizations, the benchmark could not fit into the largest Cyclone II device. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 37 all implementations using post-routed ModelSim simulations and we also verified the designs in hardware using the Altera DE2 board. The experimental data presented here for the hybrid implementations were collected by Jongsok Choi. The experimental data for the eXCite commercial tool were collected by Jason Anderson. Mark Aldham also helped to gather these experimental results and he performed the energy and power analysis presented later in this section. In this study, we measure the circuit speed, area, and energy consumption to assess the quality of results. Circuit speed consists of the wall-clock execution time of the circuit, the post-routed maximum clock frequency reported by Quartus, and the number of clock cycles required to complete execution. We calculate the wall-clock time by multiplying the number of clock cycles by the reciprocal of the maximum clock frequency. Circuit area consists of the number of Cyclone II logic elements (LEs), the number of memory bits, and the number of 9x9 DSP multiplier blocks. The energy consumption of an embedded system is typically a major design constraint especially for battery-powered mobile devices. To measure circuit energy, Mark Aldham used Altera’s PowerPlay power analyzer tool on the final post-routed circuit. He performed a post-route netlist simulation using Mentor Graphics’ ModelSim to gather circuit switching activity data for each benchmark. ModelSim generates a VCD (value change dump) file containing the switching activity for each design signal. PowerPlay reads this VCD file and calculates a power estimate for the design. Finally, we compute the total energy consumption of each benchmark by multiplying the average core dynamic power by the benchmark’s total execution time.

3.5.1 Experimental Results

We present the circuit speed measurements across the benchmark suite for all our experiments in Ta- ble 3.5. The experiments are presented from left to right in the order specified previously, with software- only on the left and hardware-only on the right. Three speed metrics are shown in columns for each experimental flow: Cycles gives the required number of clock cycles, Freq shows the post-routed maxi- mum clock frequency (MHz), and Time lists the total wall-clock execution time (µs). The second last row of the table shows the geometric mean results for each column. We excluded dhrystone from the geomean calculation because eXCite could not synthesize this benchmark. The last row of the table gives a ratio of the geomean of each column relative to the corresponding metric of the software-only flow (MIPS-SW ). In the MIPS-SW flow shown in Table 3.5, we measured the clock frequency of the processor at 74 MHz, with the benchmarks completing within 6.8K–30M clock cycles. The wall-clock time for the benchmarks ranged from 92–401K µs. For comparison, we executed these benchmarks on an Altera NIOS II/f (fast) soft processor and found that the performance was twice as fast as the Tiger MIPS processor. However, the NIOS II is not open-source and has a 6-stage pipeline, while the open-source Tiger MIPS has a 5-stage pipeline and is not tuned for Altera FPGAs. In the LegUp-Hybrid2 flow, we implemented the second most compute-intensive function and its descendants in hardware. We observe that the geomean number of clock cycles during execution is cut in half compared to a software implementation. However, the Hybrid2 benchmarks have a 10% lower geomean clock frequency than the processor, resulting in an overall geomean wall-clock time improvement of 45%, or a 1.8× speed-up, compared to MIPS-SW. The next LegUp-Hybrid1 flow implements additional computation in hardware. We show in Table 3.5 that the number of cycles is 75% better in LegUp-Hybrid1 than software-only. The geomean clock frequency is again lower than the processor by 12%, resulting Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 38 s) µ (MHz) ( --- eXCite-HW 370 25223 49 15 5 0.08 0.55 0.15 2,029 445,739 42 46 4,3442,268 76 137 43 57 53 21,99255,679 29 51 761 1,093 49,709 40 1,241 14,594 41 357 209,614 36 5,845 238,009 62 3,809 Cycles Freq. Time 3,248,488 23 143,358 s) µ (MHz) ( LegUp-HW 347 86 4 0.12 0.96 0.12 2,3302,144 124 756,656 19 59 29 6,4438,578 90 113 92 72 93 36,79514,022 46 61 804 231 67,466 63 1,077 10,202 85 119 20,854 72 292 209,866 65 3,208 247,738 87 2,850 Cycles Freq. Time 5,861,516 47 124,475 s) µ (MHz) ( 0.25 0.88 0.28 5,6494,538 772,471 66 79 73 69 6,463 31 76 86 LegUp-Hybrid1 96,94826,878 57 50 1,695 543 80,67818,505 68 61 1,182 17,017 303 8025,509 77 214 331 42,701 66 650 319,931 64 5,022 265,221 76 3,508 Cycles Freq. Time 15,978,127 47 342,511 s) µ Table 3.5: Speed performance results. (MHz) ( 0.50 0.90 0.55 6,463 76 86 LegUp-Hybrid2 55,014 5514,67215,973 75 1,001 10,784 78 7629,500 196 205 61 143 34,859 480 7325,599 78 475 330 86,258 67 1,286 159,883 62680,343 63 2,595 10,763 293,031 66 4,463 358,405 77 4,631 Cycles Freq. Time 16,072,954 51 313,925 s) µ (MHz) ( 11 1 MIPS-SW 6,796 74 92 73,777 7416,49671,507 74 993 7439,108 222 963 7443,38436,753 74 527 7428,855 584 74 495 389 193,607 74954,563 74 2,607 12,854 173,332 74 2,334 Cycles Freq. Time 2,993,369 74 40,309 1,209,523 74 16,288 29,802,639 74 401,328 Benchmark adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone Geomean: Ratio: Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 39 in an overall geomean wall-clock time 72% faster, or a 3.6× speed-up over MIPS-SW. We observe the following trend: as we synthesize a greater proportion of the program into hardware, we measure an increase in performance. We note that these clock frequencies are for targeting a Cyclone II FPGA and would be higher if we were targeting a 40nm Stratix IV FPGA. The last two experiments shown on the right of Table 3.5 demonstrate a hardware-only flow with either LegUp or the commercial HLS tool eXCite. We observe that the benchmarks synthesized to hardware with LegUp (LegUp-HW ) have a geomean cycle execution time 88% faster than the software- only implementation and have approximately the same geomean clock frequency. When we synthesize the benchmarks with eXCite, the geomean number of cycles is even lower, with only 8% of that required by the software-only flow. However, the geomean clock frequency of the eXCite-generated circuits is 45% worse than the MIPS processor. We observed that eXCite tends to perform more operator chaining than LegUp during scheduling, which can hurt the clock frequency but improve the number of cycles required by the benchmark. The overall geomean wall-clock time improvement over MIPS-SW was comparable for both LegUp and eXCite, with LegUp-HW providing an 88% improvement, or an 8× speed-up, and eXCite-HW providing an 85% improvement, or a 6.7× speed-up. These are significant performance improvements over running these benchmarks on the processor. We saw the greatest speed-up on the dfsin benchmark, with LegUp improving wall-clock time by over 34×. Across these benchmarks, LegUp- generated circuits achieved an average wall-clock execution time that was 20% faster than equivalent circuits synthesized by eXCite. Therefore, we can infer that our HLS implementation is of reasonable quality. We observe no performance benefit for running a portion of these benchmarks in software, as exper- imented in the hybrid scenarios. Furthermore, these benchmarks contain no unsupported C language constructs that would be forced to run in software. However, exploring this software/hardware design space is useful to exercise LegUp functionality and to verify the correctness of these generated hybrid systems. We now discuss a few outliers in the results shown in Table 3.5. For the aes benchmark, the LegUp-HW implementation has nearly 5× faster wall-clock time execution than the eXCite-HW imple- mentation. Conversely, for the motion benchmark, LegUp’s implementation is nearly 4× slower in terms of clock cycles than eXCite’s implementation. We attribute these differences to the greater amount of functional unit pipelining performed by LegUp, particularly for division operations. This pipelining causes higher cycle latencies for LegUp-synthesized circuits compared to those produced by eXCite but improves the overall clock frequency. For the jpeg benchmark, the LegUp-Hybrid1 implementation has a higher wall-clock time than the LegUp-Hybrid2 implementation despite offloading a greater proportion of the program to hardware. This was caused by an increase in the number of memories, and consequently multiplexing, in the memory controller which decreased the clock frequency. We show area results across the benchmarks for each flow in Table 3.6. The area metrics are shown in groups of three columns: the number of Cyclone II logic elements (LEs), the number of memory bits used (# bits), and the number of 9x9 DSP block multipliers (Mults). Like the performance table presented earlier, we provide the geometric mean and the ratio of columns relative to MIPS-SW in the last two rows of Table 3.6. We calculated the geomean for columns containing zeros by replacing the zeros with ones4. We show in Table 3.6 that the MIPS processor requires 12.2K LEs, 226K memory bits, and 16

4This convention is used in life sciences studies. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 40 multipliers. The hybrid system consists of both the MIPS processor and a custom hardware accelerator, therefore consuming more area. We observed that the LegUp-Hybrid2 flow increases the number of LEs, memory bits, and multipliers by 2.23×, 1.14×, and 2.68×, respectively, compared to MIPS-SW. The LegUp-Hybrid1 flow generates a larger hardware accelerator requiring 2.75× LEs, 1.16× memory bits, and 3.18× multipliers compared to MIPS-SW. We disabled link time optimizations in LLVM during the hybrid flows. Link time optimizations are late-stage compiler optimizations performed after linking object files, which we found were inlining the function we were trying to accelerate in the hybrid flow. However, we enabled link time compiler optimizations for the MIPS-SW and LegUp-HW flows because these optimizations can significantly improve circuit speed and area. For example, for the jpeg benchmark the LegUp-Hybrid1 implementation has a larger circuit area than the combined area of the MIPS-SW and LegUp-HW implementations due to disabling link time optimizations in LegUp-Hybrid1. For the hardware-only implementations shown in Table 3.6, the LegUp-HW flow requires 28% more LEs than the MIPS processor on average, while the eXCite-HW implementations require 7% more geomean LEs than MIPS-SW. We observe that both the LegUp-HW and the eXCite-HW flow require far fewer memory bits than the MIPS processor alone. We found that LegUp-HW implementations required more 9x9 multipliers than the corresponding benchmarks synthesized by eXCite. We believe this is due to more aggressive multiplier sharing performed by eXCite during binding. Focusing on Cyclone II logic elements, the LegUp hardware-only implementations require on average 19% more LEs than circuits produced by eXCite. We can also multiply the wall-clock time and logic elements to calculate an area-delay product metric, to account for the inherent trade-off between area and delay. We find that LegUp-HW and eXCite-HW have nearly identical area-delay products: ∼4.6M µs-LEs vs. ∼4.7M µs-LEs, with LegUp requiring more LEs on average but achieving better wall-clock time. We consider these results encouraging, given that this study used the first version of LegUp. We show the power and energy consumption across the benchmarks for each flow in Table 3.7. We measured the average dynamic power consumption (mW) and the total energy consumption (nJ) for each circuit. We observe in Table 3.7 that the dynamic power of the processor is about the same as in the geomean power for the LegUp-Hybrid1 flow, with the dynamic power increasing by 12% on average in LegUp-Hybrid2. The hardware-only implementations consume significantly less geomean dynamic power than the processor, with the LegUp-HW flow requiring 55% less power, and the eXCite- HW flow requiring 70% less power. The geomean energy consumptions of each flow show an even greater improvement than dynamic power, which can be explained through the equation: Energy = P ower×Time. We observe better energy consumption from lower power, but also from the improvement in the wall-clock time of the faster hybrid and hardware-only flows. Consequently, energy consumption is improved dramatically as we synthesize increasing amounts of computations into hardware. The LegUp- Hybrid2 flow uses 47% less energy and the LegUp-Hybrid1 flow uses 76% less energy on average than the processor, or a 1.9× and 4.2× reduction in energy consumption. The hardware-only implementations consume even less energy, with the LegUp-HW flow consuming 94% less energy on average than the MIPS-SW flow, an 18× energy reduction. The eXCite-HW flow uses over 95% less energy than the processor, a 22× energy reduction. Figure 3.7 summarizes all the geomean wall-clock times, cycle counts, clock frequencies, logic ele- ments, power, and energy results across the benchmark suite for the five flows we considered here. The horizontal axis labels the flow for each group of six metrics. The vertical axis provides the ratio of each measurement to the corresponding metric in the MIPS-SW flow. Figure 3.7 shows that the hardware- Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 41 LEs # bits Mults N/A N/A N/A 1.07 0.00 0.32 9,4169,4824,536 0 06,114 0 3,280 0 32 2,260 26 3,072 2 8 16,65446,562 6,572 18,68831,045 33,944 28 0 0 22,27430,420 105,278 020,476 20 16,384 38 13,684 3,072 0 0 13,101 496 5 LEs # bits Mults 1.28 0.13 0.72 8,881 17,1204,861 12,032 0 32 4,479 4,4804,985 8 82,008 0 22,605 29,12028,490 38,33615,064 150,816 300 20,159 0 12,416 0 38,933 12,864 62 19,131 11,16846,224 253,936 100 70 172 13,238 34,75212,483 134,368 0 0 15,646 28,822 12 LEs # bits Mults 2.75 1.16 3.18 46,301 242,94468,031 245,82431,020 300 342,75226,148 233,472 40 36,946 233,472 16 20,284 233,472 16 54,450 233,632 78 30,808 233,296 48 64,441 116 354,54418,857 142 230,30418,013 254 242,88029,754 359,136 24 16,310 226,009 16 16 16 33,629 261,260 51 Table 3.6: Area results. LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW LEs # bits Mults 2.23 1.14 2.68 25,628 242,94456,042 244,80025,030 152 341,88822,544 233,664 32 28,583 226,009 16 16,149 226,009 16 34,695 233,472 46 25,148 232,576 48 46,432 338,096 78 18,857 114 230,30428,761 252 243,10420,382 359,136 24 15,220 226,009 16 16 16 27,248 258,526 43 MIPS-SW 1 1 1 LEs # bits Mults 12,243 226,00912,243 226,00912,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 12,243 226,009 16 16 16 12,243 226,009 16 Benchmark adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone Geomean: Ratio: Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 42 eXCite-HW 0.30 0.046 64.135.4 48,844.2 31.821.669.835.1 533.9 1,469.0 25.2 86,642.8 61.7 98.1 65.6 4,820.8 1,433.5 3,265.9 249,909.5 N/A N/A 65.9 23,549.97 448.4184.4 489,919.5 1,078,044.8 209.2 29,993,289.7 (mW) (nJ) Power Energy LegUp-HW 0.45 0.06 85.8483.6640.00 1,631.0 2,426.0 41.44 160.0 73.91 2,983.7 6,873.5 157.83123.09121.30 126,894.4 28,434.1 389,118.5 129.14102.70261.37 139,082.3 32,533,823.6 11,605.4 152.23116.19 433,860.6 13,826.0 (mW) (nJ) Power Energy 100.75 29,390.23 LegUp-Hybrid1 0.88 0.24 175.80226.23214.16 298,018.1 178.98 122,789.4 193.20 1,075,608.0 202.21240.06 12,086.6 178.90 13,299.9 204.74 5,981.8 283,857.4 166.74 70,124,803.7 158.22 54,145.6 211.64243.18 12,752.7 32,060.0 685,437.1 74,220.9 (mW) (nJ) Power Energy 194.46 122,511.37 Table 3.7: Power and energy results [Aldha 11b]. LegUp-Hybrid2 1.00 0.53 205.80221.84274.46221.09 534,073.1 257.54 222,014.5 2,954,049.3 171.13274.49 39,015.7 167.91 49,101.0 207.30 1,224,986.7 21,593.8 166.74316.77 65,082,605.3 80,594.0 227.70230.82 12,752.7 150,562.9 965,543.0 71,830.7 (mW) (nJ) Power Energy 221.61 273,162.62 MIPS-SW 1.00 1.00 264.57231.63127.77 689,783.0 157.61 230,125.2 214.70 1,642,398.5 122.50240.75 35,011.1 206,741.0 253.35281.92 9,704,502.3 11,210.6 239.86 113,142,738.7 133,420.8 274.44383.21 140,130.3 260.15 135,824.3 6,241,622.4 101,084.8 (mW) (nJ) Power Energy 221.67 517,413.28 Benchmark adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone Geomean: Ratio: Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 43

Figure 3.7: Summary of geomean experimental results across the benchmark suite. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 44 only implementations offer the best performance compared to software-only or hybrid implementations. The plot demonstrates LegUp’s usefulness for exploring the hardware/software design space.

3.5.2 Comparison to Current LegUp Release

We have made considerable improvements to the HLS algorithms since the first release of LegUp was used to perform this study. The current version of LegUp is almost ready for our fourth release. To illustrate this improvement, we compare the quality of results produced by the current version of LegUp against the original LegUp release we used in this chapter. We use the same benchmarks as presented earlier, and we target the Altera Cyclone II FPGA (this was the only device supported by LegUp 1.0). We use the pure hardware LegUp flow without a processor and specify a difficult-to-meet timing constraint. Table 3.8 shows the comparison. Column 1 gives the name of each benchmark. The next columns give the number of Cyclone II LEs, memory bits, and 9x9 multipliers, execution cycles, FMax (MHz), and wall-clock time (µs), respectively. For readability, we repeat the same results presented earlier in this section in the “1.0” columns, which we compare side-by-side to the current version of LegUp in the “Cur” columns. The second last row presents geometric mean data for each column, while the last row presents the ratio of the current LegUp geometric mean vs. LegUp 1.0. On almost all metrics, we see significant quality-of-results improvements vs. the first LegUp release. On average, wall-clock time improved by 48%, cycle count by 38%, FMax by 19%, LEs by 41%, and multipliers by 8%. The only metric where we perform worse is memory bit usage, which increases by 17% on average in the current version of LegUp. The majority of this improvement in circuit quality can be traced back to the following changes. First, the author fixed the scheduling of phi and branch instructions, which could be chained with other operations to decrease the number of clock cycles. The author also removed combinational loops that could occur in the binding step of LegUp, which were reducing the circuit clock frequency. Qi- jing (Jenny) Huang improved LegUp to use dual-port memories instead of single-port memories, which allowed greater instruction level parallelism. In 2010, Yuko Hara updated the jpeg benchmark in the CHStone benchmark suite to contain an approximately 50% smaller image, which sped up the bench- mark. Jason Anderson experimented with different clock period constraints and achieved better geomean performance across the CHStone benchmarks. Finally, the memory architecture described in Chapter 6 also improved performance and area.

3.6 Research using LegUp

In this section, we will give examples of how LegUp has enabled further high-level synthesis research by highlighting recent publications that have used LegUp. The co-authored work by Huang [Huang 13] was the first to study the impact of standard software compiler optimizations on high-level synthesis. Huang proposed methods of formulating a specific se- quence of compiler passes tailored for our benchmarks, giving a 16% faster geomean wall-clock time compared to the default LegUp −O3 optimization passes. The LegUp framework has also allowed ar- chitecture studies on processor/parallel-accelerator systems, like the co-authored study by Choi on the impact of cache architecture on the performance and area of our system [Choi 12a]. Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 45 Time 4.1 2.7 18.829 7.7 21 7293 61 103 119 41 804231 157 141 113 65 1 0.52 1.0 Cur 272 142 3,208 1,763 1,077 797 2,850 1,413 124,475 28,097 Freq. 1 1.19 85 121 4661 92 65 64 93 7586 92 63 87 59 75 47 90 90 47 92 102 87 80 118 73 86 1.0 Cur 124 89 plementation) 1 0.62 Cycles 347 234 2,3302,144 684 1,938 6,6566,443 5,868 8,578 6,228 8,264 1.0 Cur 36,79514,022 14,450 9,052 67,466 59,742 10,202 5,020 19,738 12,200 209,866 163,950 247,738 166,768 5,861,516 1,320,580 000 0 0 0 800 8 0 0 0 0 1 0.92 Mults 6232 48 32 70 64 9.6 8.9 1.0 Cur 300 172 100 70 172 222 1 1.17 # bits 4,480 5,504 1.0 Cur 29,12038,336 35,646 36,814 17,12012,416 17,024 12,032 13,495 12,864 12,032 11,168 13,879 15,488 34,752 32,768 82,008 2,136 31,236 26,676 150,816 182,208 253,936 469,992 134,368 134,656 LEs 1 0.59 Table 3.8: LegUp 1.0 vs. current LegUp version. (Hardware-only Im 8,881 6,443 4,861 3,330 44,79 2,837 4,985 2,302 1.0 Cur 22,605 18,52728,49015,064 13,407 6,038 20,159 11,837 38,93319,131 24,569 46,224 9,920 44,952 13,238 13,13912,483 3,334 14,328 8,492 Benchmark chstone/adpcm chstone/aes chstone/blowfish chstone/dfadd chstone/dfdiv chstone/dfmul chstone/dfsin chstone/gsm chstone/jpeg chstone/mips chstone/motion chstone/sha dhrystone Geomean Ratio (Cur/1.0) Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 46

Two recent works have focused on debugging in HLS, using LegUp as the backend. Calagar [Calag 14] proposed a source-level debugging framework that offers gdb-like step, break, and data inspection func- tionality for a HLS-generated hardware circuit. With the proposed framework, the user can inspect the values of logic signals in the hardware from the C source code perspective. The logic signal values come from one of two sources: 1) a logic simulation of the RTL, or 2) an actual execution of the hardware on an FPGA. Goeders [Goede 14] proposed inserting debug instrumentation into the LegUp-generated circuit, which allows a debugger application to start and stop the circuit, monitor variables and set breakpoints. The instrumentation contains trace buffers to record the control and data flow in real-time, allowing the debugger to retrieve this data and replay the execution in a GUI. HLS area optimizations have also been studied using LegUp. Gort [Gort 13] presented an algorithm for reducing area by minimizing signal bitwidths in LegUp. Gort proposed inferring bitmasks and ranges for variables using constant propagation at compile-time. For programs with predictable inputs, he used run-time profiling data to determine variable ranges and optimize further. Klimovic [Klimo 13] proposed using LegUp to optimize hardware accelerators for common-case inputs, as opposed to worst-case inputs, allowing accelerator area to be reduced by 28%. When inputs exceed the range that the hardware accelerators can handle, a software fallback function is automatically triggered. The co-authored work by Hadjis [Hadji 12a] used LegUp to investigate the impact of FPGA architecture on resource sharing patterns of interconnected operators. Hadjis found that the type of operations that are beneficial for high-level synthesis resource sharing varies depending on whether we target Cyclone II (4-LUT) or Stratix IV (6-LUT) Altera FPGA architectures. We have made some impressive strides towards making FPGAs easier to program with LegUp. This was evidenced by a group of undergraduates: Victor Zhang, Ahmed Kammoona, and Bryce Long, who extended LegUp to support PCIe communication between a Stratix IV FPGA and a host PC. They displayed a Mandelbrot animation simulation on the PC’s monitor where computation was being offloaded to 128 accelerators running on the FPGA, which executed 5.5× faster and with 5.0× less energy than the same program executing dual-threaded on an Intel Core 2 Duo processor. Most importantly, the Mandelbrot kernels were entirely synthesized by LegUp, with no hand-coded RTL required. Ruo Long (Lanny) Lian and William Cai also used LegUp to generate a working hardware implementation of an artificial intelligence that could play the two-player abstract puzzle game, Blokus Duo. They entered the synthesized hardware design into the FPT 2013 design competition [Cai 13] to compete against other hardware implementations.

3.7 Summary

In this chapter, we introduced an open-source HLS research framework called LegUp. LegUp can synthe- size a standard C program into a hybrid FPGA-based processor/accelerator architecture consisting of a processor communicating with custom hardware accelerators. With LegUp, researchers can explore the hardware/software design space, in which a program is partitioned into functions that are synthesized automatically into custom hardware circuits. The remaining functions are executing in software on an FPGA-based processor. Our experimental results have shown that a suite of benchmarks synthesized into hardware-only implementations by the first release of LegUp execute 8× faster and consume 18× less energy than when executed in software on a MIPS soft processor. We also show that LegUp’s syn- thesized hardware circuits are comparable to those generated by a commercial HLS tool, eXCite, both Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 47 in terms of circuit wall-clock time and in area-delay product. Our overarching goal with LegUp is to make programming FPGA devices easier and more accessible to software engineers. We hope to expand the number of users who can leverage FPGA devices to speed up specific applications, particularly in the embedded systems community. LegUp, along with its suite of benchmark C programs, is a robust well-tested open-source platform for HLS research that has enabled many research studies over the past few years. We expect LegUp will continue to support a variety of research advances in hardware synthesis, as well as in hardware/software co-design. LegUp is available for download at: http://legup.eecg.utoronto.ca (or http://www.legup.org). Chapter 4

Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis

4.1 Introduction

LegUp enables us to study high-level synthesis techniques aimed at exploiting specific characteristics of an FPGA architecture. In this chapter, we will present a high-level synthesis area optimization that is particularly suitable for modern FPGA architectures. In high-level synthesis, we often must meet user-defined resource constraints by minimizing the area of the synthesized circuit. Area reduction is traditionally accomplished by resource sharing: using the same functional unit to perform two or more operations. A limitation of resource sharing is that operations sharing the same functional unit must be scheduled into mutually exclusive clock cycles. Consequently, resource sharing can lengthen the overall schedule, hurting circuit performance. In this chapter, we present a new approach to resource sharing by applying the technique of multi-pumping, which can overcome this limitation. Multi-pumping refers to the existing circuit technique of operating a hardware block at a higher clock frequency than its surrounding system. Typically, the multi-pumped unit is clocked at twice the system frequency, or double-data-rate (DDR). We can share a single DDR multi- pumped functional unit between two operations that are scheduled during the same system clock cycle. Multi-pumping is an area-reduction technique that does considerably less harm to speed performance in comparison to traditional resource sharing (provided that the functional units can indeed be clocked at 2× the system clock frequency). Modern FPGA architectures contain prefabricated special purpose “hard” blocks such as block RAMs, DSP blocks, and even entire processors. These blocks are implemented as ASIC-like hard IP blocks, and are distinct from the reconfigurable “soft” logic, comprised of lookup tables (LUTs), regis- ters, and other programmable circuitry. This is quite different than ASIC design, where standard cells have similar timing characteristics varying only modestly with cell size. DSP blocks can perform the types of operations essential to digital signal processing, multiply and multiply-accumulate operations, with very high speed and low power. In modern FPGAs, such as Stratix IV [Stra 10], DSP blocks can

48 Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 49 operate at speeds above 500MHz, whereas typical FPGA designs operate at considerably lower speeds (in the 100–300MHz range). Consequently, DSP blocks are particularly suitable for our multi-pumping sharing technique. Therefore, in this work, we focus on multi-pumping DSP blocks, however, the ideas proposed are applicable to other types of blocks. To the best of our knowledge, this is the first work to apply multi-pumping for resource reduction automatically in a high-level synthesis context. We evaluate the multi-pumping approach compared to traditional resource sharing and target the Altera Stratix IV 40-nm commercial FPGA [Stra 10]. Our results show that resource sharing using multi-pumping is an effective approach for saving circuit area. Furthermore, when we compared this approach to traditional HLS resource sharing we can achieve the same area reduction but with significant performance advantages. Specifically, to achieve a 50% reduction in DSP blocks, traditional resource sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance. The remainder of this chapter is organized as follows: Section 4.2 presents related work. Section 4.3 introduces the concept of multi-pumping, provides a characterization of the multi-pumped multiplier unit, and compares multi-pumping to traditional resource sharing. Section 4.4 describes the high-level synthesis algorithms necessary for resource reduction using multi-pumping. Section 4.5 presents an experimental study. Section 4.6 offers a summary.

4.2 Background

Resource sharing has been studied extensively in high-level synthesis literature over the past two decades [Cong 11, Gajsk 92]. Two recent studies by Gort [Gort 13] and Hadjis [Hadji 12a] were already discussed in Section 3.6. Cong and Wei [Cong 08] presented a method for sharing patterns of operators by analyzing their graph edit distance. Hara-Azumi and Tomiyama [Hara 12] formulated a simultaneous binding and allocation integer linear programming problem to minimize multiplexer area under a clock constraint. It is worth mentioning that multi-pumping, as a concept, is not new. Multi-pumping has been applied in computer memory for over a decade, however typically multi-pumping is used to improve performance rather than to save area as we propose. Commodity DDR3 SDRAM allows transfers on both the positive and negative edges of the memory bus clock — doubling the effective memory bandwidth [DDR3 08]. For example, the Altera DE4 board has two SO-DIMM sockets, each of which contains a DDR2 SDRAM internally clocked at 800MHz, with a combined 128-bit wide data I/O bus to the Stratix IV FPGA. The DDR memory is multi-pumped, and we can transfer on the rising and falling clocks, so the FMax requirement is 400MHz for the DDR2 memory controller. We use the Altera DDR2 memory controller in half-rate transfer mode, which doubles the data width to 256-bits and halves the FMax requirement to 200 MHz. We have found that on a Stratix IV FPGA, meeting a timing constraint of 200 MHz is difficult but feasible for a high speed circuit. For a point of reference, the maximum possible FMax of a circuit synthesized to a Stratix IV is about 550MHz due to minimum clock pulse constraints of the DSP block internal registers. Multi-pumping is also widely used in memories to “mimic” the availability of extra memory ports. Choi et al.’s work in [Choi 12a] found that multi-pumped caches had the best performance and area for FPGA processor/parallel-accelerator systems. A Xilinx white paper [Tidwe 05] describes how multi- Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 50

2x Clock 1x Clock Follower DSP Blocks: Alignment Register C 0 w A x B A 1 w w 2w

2w D C x D w 0 B 1 w 2w w Optional Pipeline Registers Clock Domain: 1x 2x 1x

Figure 4.1: Multi-pumped multiplier (MPM) unit architecture. pumping can improve the throughput of a DSP block in isolation, outside of the HLS context.

4.3 Multi-Pumped Multiplier Units: Concept and Characteri- zation

We exploit the high operating frequency of DSP blocks relative to the surrounding FPGA soft logic to multi-pump DSP multipliers at double-data-rate. Figure 4.1 shows the circuit architecture of our multi- pumped multiplier (MPM) unit. The MPM consists of a multiplier implemented by DSP blocks, with multiplexers on the inputs to steer incoming data. The number of DSP blocks required to implement 8, 16, and 32-bit multipliers is 1, 2, and 4, respectively, for either signed or unsigned numbers. Unsigned and signed 64-bit multipliers require 16 and 32 DSPs, respectively. The 2× clock frequency must be exactly twice that of the system 1× clock to ensure correct multi-pumping behaviour. Assuming the multiplier has no pipeline registers, the operation of Figure 4.1 proceeds as follows: the positive edge of the 1× clock occurs, causing inputs A, B, C, and D to transition. The 1× Clock Follower signal (discussed below) matches the 1× clock and is high. For the next half of the 1× clock period, A is multiplied by B. At the half-way point of the 1× clock period, the rising edge of the 2× clock triggers the alignment register to store the product of A and B. For the second half of the 1× clock cycle, the 1× clock follower is low, and C is multiplied by D. At the rising edge of the 1× clock both A × B and C × D are available and are stored at the MPM outputs by registers in the 1× clock domain (not shown in the figure). We derived the 1× clock and 2× clock from the same PLL to match their clock phases and to avoid a synchronizer on the 1×-to-2× clock-domain crossings. We can use optional pipeline registers inside the DSP blocks to improve the FMax of the 2× clock and reduce the setup time on the 1×-to-2× clock- boundary crossings. An Altera Stratix IV [Stra 10] DSP block has up to 3 optional pipeline stages: one at the inputs and two at the outputs after the internal multiplier, as shown in Figure 4.1. The actual FMax of the 1× clock when using multi-pumping is given by:

FMax2x FMax = min(FMax1 , ) (4.1) x 2

Where FMax1x is the maximum operating frequency of the circuits in the 1× clock domain, and FMax2x is the maximum operating frequency of the circuits in the 2× clock domain. For instance, if the circuits in the 1× clock domain could operate at FMax1x = 300Mhz, but the DSP 2× clock has a maximum frequency of 400MHz, then the 1× clock must be reduced to 200MHz. Consequently, adding a multi- pumped multiplier can potentially reduce the FMax of high-speed circuits by putting a ceiling on the Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 51

AB DFF DFF DFF 1x Clock 1x Clock 2x Clock Follower Clock Domain: 1x 2x

Figure 4.2: Clock follower circuit from [Tidwe 05]. 600 S W=64 550 U W=64 U/S W=32 500 S W=16 U W=16 450 U/S W=8 400

350

300 Fmax (MHz) 250

200

150

100 0123456 Pipeline Stages (P)

Figure 4.3: Multi-pumped multiplier unit FMax characterization. system clock frequency. We mitigate this problem by using DSP pipeline registers (discussed below). In Figure 4.1, the 1× Clock Follower has an identical waveform to the 1× clock signal but the clock follower is driven by a 2× clock register. We cannot drive the select lines directly with the 1× clock signal because that could cause a hold-time violation. For instance, when the 2× clock has a positive edge, the DSP input pipeline registers are receiving data but at the exact same time the 1× clock could transition from 0 to 1. If the 1× clock is driving the select line of the multiplexer, this could change the multiplexer output too quickly, violating the hold-time requirement of the destination DSP input 2× clock registers. Consequently, we need a signal that is identical to the 1× clock but slightly delayed: the 1× clock follower. Figure 4.2 gives the clock follower circuit from [Tidwe 05]. On device startup, a synchronous reset sets all three registers to logic-0. At this point, A=0 and B=0, therefore on the rising edge of the 1× and 2× clocks, the 1× clock follower signal transitions from 0 to 1 and A transitions from 0 to 1. On the next rising edge of the 2× clock, B transitions from 0 to 1, and the 1× clock follower from 1 to 0. On the rising edge of the 1× and 2× clocks, A transitions from 1 to 0, and the 1× clock follower transitions from 0 to 1. This pattern continues with the 1× clock follower transitioning on every positive edge of the 2× clock, matching with the 1× clock signal. The 1× clock follower is delayed by the clock-to-Q time of the 2× register.

4.3.1 Multi-Pumped Multiplier Characterization

We characterized the multi-pumped multiplier (MPM) unit in Figure 4.1, for an Altera Stratix IV FPGA [Stra 10], using three parameters: the number of pipeline stages (P ), also called the latency; the width of inputs (W ); and the type of multiplier, either signed (S) or unsigned (U). Figure 4.3 shows how the FMax of the MPM in Stratix IV is impacted by P , the number of pipeline stages. If P is greater Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 52

1400 S W=64 U W=64 1200 U/S W=32 U/S W=16 U/S W=8 1000

800

600 Registers

400

200

0 0123456 Pipeline Stages (P)

Figure 4.4: Multi-pumped multiplier unit register characterization. than three, we will implement additional pipeline registers outside of the DSP blocks. Each curve in the figure represents one choice of input width (W ) and whether the data is unsigned (U) or signed (S). As expected, there is a tradeoff between pipeline stages and the MPM FMax: increasing the latency allows the FMax to increase. Observe in Figure 4.3 that setting P greater than 3 is only beneficial to the FMax of 64-bit multipliers; FMax is unchanged for W =32 or lower. We expected multipliers with smaller bit widths to have higher FMax, however, this was not always the case. For P =3, the FMax improves from 483MHz for W =8, to 550MHz for W =16. We found that this improvement was caused solely by a change in cell delay within the DSP block where the critical path is located. At higher clock frequencies (above 450MHz), the MPM is restricted by the minimum clock pulse width requirements of the registers inside the DSP blocks, which is a property of the DSP blocks and dependent on W . For instance, we found that the clock frequency could have been 655MHz for W =16 and 533MHz for W =8, but was restricted by minimum clock pulse width, as shown in Figure 4.3. P The cycle latency of the MPM, in terms of the 1× system clock, is ⌈ 2 ⌉. For instance, if the MPM has a system cycle latency of two, from the 1× system clock perspective, then we have four 2× clock cycles to multiply the inputs. Because of the double-data-rate operation of the MPM, there is a wasted 2× clock cycle whenever the MPM has an odd number of pipeline stages. For instance, an MPM with a one 1× clock cycle latency constraint can have P =1 or P =2. Given that the FMax of the 2× clock in the MPM increases as P increases, we should always choose P =2 over P =1 — there is no additional cost in registers because the optional pipeline registers are internal to the DSP blocks. Figure 4.4 shows how the number of registers, outside of the DSP blocks, varies with W , P and signedness. Register counts are indicative of silicon area cost, and those in Figure 4.4 include input and output registers in the 1× clock domain that are not shown in Figure 4.1 (i.e. registers to hold input operands and products). Observe that register count is unaffected when sweeping P from zero to three. This is expected, as the DSP blocks have 3 internal pipeline stages, meaning registers in general FPGA logic blocks are not needed. An exception to this is the 64-bit multiplier, which requires additional registers for latencies of two and above. While multi-pumping can save DSP blocks, it can also be applied to raise computational throughput. For instance, if the throughput of a circuit pipeline is limited by multipliers, we can multi-pump every multiplier. This can double the original pipeline’s throughput, without requiring any additional DSPs, Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 53

100 100 100 MPM

Loop BodyResource Sharing Multi−Pumping

Figure 4.5: Loop schedule: multiplier sharing vs. multi-pumping. State

A 0 A w w C 1 B 2w w w C 2w w B D 2w w 0 w D w 1

Figure 4.6: Loop hardware: original vs. resource sharing. if the application has enough data bandwidth to saturate the new pipeline.

4.3.2 Multi-Pumping vs. Resource Sharing

Multi-pumping can be seen as an alternative to traditional resource sharing in high-level synthesis with two differences. First, multi-pumping can share two multipliers that are scheduled in the same state, while resource sharing can only share multipliers from different states. Second, multi-pumping requires that the system clock be half the 2× clock rate. Therefore, when multi-pumping, we should pipeline multipliers to a greater extent than when resource sharing. Pipelining will increase the 2× clock rate and avoid constraining the system clock. We illustrate the difference between resource sharing and multi-pumping by considering a loop that performs two independent multiplies every iteration and finishes in 100 cycles, taking one cycle per iteration, as shown in Figure 4.5. Assume that we wish to reduce the number of DSPs required for the loop. We must reschedule the multipliers into distinct states to apply traditional resource sharing to the loop, saving one multiplier. Figure 4.5 (middle) shows the new schedule and Figure 4.6 shows the hardware before and after resource sharing. Assuming a single-cycle multiplier, the loop (with resource sharing applied) now takes 200 cycles — twice as long as the original. We can apply multi-pumping to the same loop and achieve the same reduction in multipliers without the increase in cycles. Although we must now pipeline the multi-pumped unit to achieve the same FMax as the original, we can still start one new multiply every cycle with loop pipelining [Ramak 96], assuming there are no loop-carried dependencies between iterations. If we pipeline the multi-pumped unit with three stages, the loop will complete in 102 cycles — 2 cycles to fill the pipeline, then one loop iteration finishes every clock cycle for the subsequent 100 cycles. For multi-pumping to provide a performance and area benefit over resource sharing, a few conditions are necessary. First, two or more multipliers need to be scheduled into the same state. Next, these multipliers should occur in a section of the code that is executed multiple times otherwise the impact on overall circuit performance will be minor. Lastly, there needs to be limited multiplier mobility, meaning that there is little flexibility to change the scheduled state of a multiplication operation without impacting the schedule of its successor operations. So, if we reschedule these multiplies into separate states then Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 54 the circuit’s performance will decrease (due to a longer schedule). To calculate the savings from multi-pumping, we can calculate the maximum multipliers used in any state, which we designate as X. X is the minimum number of multipliers that can be achieved using traditional resource sharing without modifying the schedule. However, using multi-pumping, the number X of multipliers can be reduced to ⌈ 2 ⌉.

4.4 Multi-Pumping DSPs in High-Level Synthesis

As discussed in Chapter 3, there are three main steps to high-level synthesis: allocation, scheduling, and binding. Binding is performed after scheduling and solves the problem of assigning the operations in the program to hardware functional units. In other words, binding is the key step where traditional resource sharing is implemented, and is also where we may choose to assign two operations to a multi-pumped multiplier unit. We implemented our multi-pumping approach in LegUp by modifying the constraints we use during scheduling and binding to handle the MPM functional units. We can think of the multi-pumped unit as having two “ports” corresponding to the two inputs/outputs that occur each system clock cycle. We can then bind multiply operations to a MPM function unit in an analogous manner to binding loads/stores to a dual-port RAM. Given a user resource constraint of M multipliers, we can instantiate up to M MPM functional units in hardware. Furthermore, during scheduling we must ensure there are no more than 2M multiply operations per cycle. After scheduling, we bind each multiply operation to one of the 2M available ports on the MPM units using weighted bipartite matching [Huang 90]. We can still use a MPM unit for a single multiply, if we only utilize the DSP blocks for half of the 1× system clock cycle. Hence, multi-pump sharing is a superset of resource sharing — we can share multipliers using multi-pumping in all cases where we could perform resource sharing, but in addition, we can share when two multipliers are scheduled in the same state.

4.4.1 DSP Inference Prediction

In our original implementation of multi-pumping, we saw an increase in the number of DSPs compared to the original circuit, rather than a reduction! We found that Altera’s Quartus II synthesis tool incorporates optimizations to avoid inferring DSPs in certain scenarios. Specifically, multiplies by a power of 2 will be replaced with a shift. Additionally, if one input to a multiply is a constant (c) and the multiply, x × c, can be implemented as (x << a) plus or minus (x << b), where a and b are constants, then Quartus will not infer a DSP block, instead preferring the shifts by constants, followed by addition. For example: x×22 will infer a DSP, while x×14=(x×16)−(x×2) can be implemented as (x << 4) − (x << 1). This optimization is common for constants under 100, but becomes rare for larger constants. We avoid multi-pumping multiply operations that will not result in DSP-block inference. Another optimization we made is an artifact of compiling software to hardware. Namely, in the high-level synthesis of C code, multiplying two 32-bit integers does not require a 64-bit result — the product is usually truncated to a 32-bit integer. If all 64-bits are required, then high-level synthesis will be forced to use a 64-bit multiply instruction and sign extend the 32-bit operands to 64-bits. However, in hardware, we can implement a 32-bit multiplier with a 64-bit product using half as many DSP blocks as a 64-bit multiplier with the output truncated to 64-bits. By detecting these 64-bit multipliers and replacing them with 32-bit multipliers, we saw a significant reduction in DSP block usage. Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 55

Figure 4.7: Image after Sobel edge detection and Gaussian blur.

Lastly, two multiply operations are only paired together in a MPM if they have the same bit width. We used a bit width minimization pass [Mahlk 01] to statically calculate the required bit width of each multiply operation.

4.5 Experimental Study

Table 4.1: Area results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) DSPs Registers ALUTs Benchmark Orig TRS MP Orig TRS MP Orig TRS MP alphablend 8 4 4 7,799 10,599 7,965 4,756 5,786 4,821 sobel 8 4 4 22,861 22,775 22,959 25,396 25,493 25,348 4matrixmult 16 8 8 10,677 25,478 11,068 10,578 13,189 10,722 gaussblur 24 12 12 10,659 10,615 10,861 10,458 10,655 10,493 idct 40 20 20 33,977 33,289 34,925 43,361 42,204 43,440 mandelbrot 144 72 72 33,729 34,449 34,548 31,112 31,291 30,702 Geomean 23 11 11 16,895 20,530 17,269 16,193 17,360 16,239 Ratio 1 0.5 0.5 1 1.22 1.02 1 1.07 1

Table 4.2: Speed performance results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) Cycles FMax (MHz) Time (µs) Benchmark Orig TRS MP Orig TRS MP Orig TRS MP alphablend 1,131 2,131 1,151 219 203 223 5.2 10.5 5.2 sobel 45,685 66,357 46,229 163 166 166 280.8 399.8 279.3 4matrixmult 8,551 19,851 8,651 157 155 158 54.5 127.8 54.8 gaussblur 26,575 45,615 27,119 176 167 176 151.2 273.8 154.0 idct 7,336 11,436 7,336 170 158 155 43.2 72.5 47.4 mandelbrot 1,899 3,307 1,963 143 150 125 13.3 22.1 15.7 Geomean 7,395 13,007 7,513 170 166 165 43.6 78.6 45.7 Ratio 1 1.76 1.02 1 0.98 0.97 1 1.8 1.05

We used six benchmarks to evaluate our multi-pumping approach: Alphablend blends two image streams. 4matmult performs four matrix multiply operations in parallel for 20×20 matrices stored in independent block RAMs. Sobel is a Sobel edge detection algorithm from computer vision, applied on an image striped over three block RAMs, shown in Figure 4.7. Gaussblur applies a Gaussian low- pass filter to blur the same image. IDCT performs 200 inverse discrete cosine transforms used in JPEG image decompression. Mandelbrot generates a 32×32 fractal image. All of the benchmarks require multipliers operating in parallel and are representative of data parallel digital signal processing Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 56 applications that DSP blocks were designed for. The benchmarks also include input data, allowing us to execute them in hardware and gather wall-clock time (execution time) results. Loop unrolling was applied to the benchmarks to increase multiplier density. We constrained the number of multipliers in each benchmark to balance multiply operations evenly across all pipeline stages to maximize multiplier utilization. We compare multi-pumping to traditional resource sharing targeting the Stratix IV [Stra 10] EP4SGX530KH40C2 on Altera’s DE4 board [DE4 10] using Quartus 11.1sp2. All benchmarks were synthesized with a 500MHz timing constraint for the 1× clock and a 1GHz constraint for the 2× clock.

Table 4.1 gives the area results for three scenarios: “Original” (the baseline with no resource reduc- tions), “TRS” (traditional resource sharing), and “MP” (multi-pumping). The “DSPs” column gives the number of DSP blocks required, which is reduced by 50% by both resource sharing and multi- pumping. Mandelbrot was the only benchmark that used exclusively 64-bit multiplication; all other benchmarks used only 32-bit multipliers. The “Registers” column gives the total number of registers required. “ALUTs” gives the number of Stratix IV combinational ALUTs. Ratios in the table compare the geometric mean (geomean) of the column to the respective geomean in the original. Table 4.2 gives speed performance results. The “Cycles” column is the total number of cycles required to complete the benchmark. The “FMax” column provides the FMax of the circuit given by the equation in Section 4.3. The “Time” column gives the circuit wall-clock time: Cycles · (1/FMax).

In the baseline and in the traditional resource sharing scenarios, we allocated multiplier functional units with two pipeline stages. Meaning that after two inputs are passed into a multiplier functional unit, we must wait two system (1×) clock cycles before the multiplier output will be valid. For multi- pumping, we increased the pipeline depth of the MPM units to three stages in the 1× clock, to minimize the impact on FMax. We chose one 1× clock stage at the MPM inputs, to minimize the delay across the 1×-to-2× clock-boundary crossing, and four 2× clock stages in the MPM unit. Recall that for designs that operate at a high FMax, the 2× clock FMax affects the system clock because the system clock must be exactly half the 2× clock. By increasing the pipeline depth of the MPM units, we can increase the 2× FMax and mitigate this effect on the system clock. The disadvantage of increasing pipeline stages is that the cycle latency required to complete a multiply also increases. However, this is hidden by having several multiply operations “in flight” within a single pipelined MPM unit at once.

The results show that both multi-pumping and resource sharing can be applied to reduce DSP usage by 50%, though multi-pumping is able to do so with less impact on circuit speed, and also with less area cost. With multi-pumping, the DSP reduction comes at a cost of 5% higher wall-clock time and 2% more registers. In contrast, traditional resource sharing increased circuit wall-clock time by 80%, ALUTs by 7%, and registers by 22% to achieve the same DSP reduction. However, the increase in registers and ALUTs when resource sharing was primarily an artifact of loop unrolling, which we used to emulate loop pipelining (due to lack of early support by LegUp). Loop unrolling with longer schedule lengths caused excessive registers to be created. We predict that with loop pipelining instead of loop unrolling, the register improvement would disappear but all other results would remain the same. Geomean execution cycles are significantly increased (76%) by the scheduling constraints imposed by resource sharing. Multi- pumping can achieve the same DSP savings with only a 2% increase in execution cycles, caused by the extra multiplier pipeline stage.

Overall, multi-pumping appears to be a viable way to reduce resource usage in HLS, while incurring significantly less speed/area cost than traditional resource sharing. Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 57

4.6 Summary

This chapter presented multi-pumping as an alternative to traditional resource sharing in high-level synthesis when targeting an FPGA device. For a given constraint on the number of FPGA DSP blocks, multi-pumping can deliver considerably higher performance than resource sharing. Empirical results over digital signal processing benchmarks show that multi-pumping achieves the same DSP reduction as resource sharing, but with a lower impact to circuit performance: decreasing circuit speed by only 5% instead of 80%. Chapter 5

Modulo SDC Scheduling with Recurrence Minimization in HLS

5.1 Introduction

In this chapter, we investigate loop pipelining [Ramak 96] scheduling algorithm improvements. Loop pipelining is a high-level synthesis scheduling technique that overlaps the execution of loop iterations to achieve higher performance. We use this schedule to generate a pipelined datapath in hardware for operations within the loop, increasing parallelism and hardware utilization. In many C applications, the majority of run time is spent executing critical loops. Consequently, loop pipelining is crucial for generating a hardware architecture with comparable performance to hand- designed RTL. Furthermore, complex loops usually have resource constraints, typically caused by limited memory ports, in combination with constraints imposed by cross-iteration dependencies. The interaction between multiple constraints can pose a challenge for loop pipelining scheduling algorithms, which, if not handled properly can lead to a loop pipeline schedule that fails to achieve the best performance. The goal of this work is to focus on loops with complex resource and dependency constraints and improve the high-level synthesis quality of results for these cases. Loop parallelism is limited by cross-iteration dependencies between operations in the loop called recurrences. Recurrences can prevent the next iteration of a loop from starting in parallel until data from a prior iteration has been computed, for instance an accumulation across iterations. The second limitation is due to user-imposed resource constraints, e.g. only allowing one floating point adder in the design. These constraints can significantly impact the final loop pipeline throughput. As discussed in Chapter 2, state-of-the-art HLS scheduling uses a mathematical framework, called a System of Difference Constraints (SDC) to describe constraints related to scheduling. The SDC framework is flexible and allows the specification of a wide range of constraints such as data and control dependencies, relative timing constraints for I/O protocols, and clock period constraints. Although loop pipelining has been well studied in HLS, until recently, the SDC approach had not been applied to scheduling loop pipelines due to non-linearities caused by describing the resource constraints in modulo scheduling. Recent work in [Zhang 13] has extended the SDC framework to handle loop pipelining scheduling by using step-wise legalization to handle resource constraints. This new SDC approach offers compelling advantages over prior methods of modulo scheduling by providing the same mathematical

58 Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 59 framework for a wide range scheduling constraints. However, there are issues applying this approach to more complex loops, particularly the class of loops that contain a combination of recurrences and resource constraints. We propose a new modulo scheduling algorithm that uses backtracking to handle complex loops with competing resource and dependency constraints, as can be expected in commercial hardware designs. This new scheduling approach significantly improves the performance of loop pipelines compared to prior work by scheduling pipelines with better throughput when the loops have complex constraints. Further- more, our scheduler is based on the SDC formulation allowing for a flexible range of user constraints. We also describe how to apply well-known algebraic transformations to the loop’s data dependency graph using operator associativity to reduce the length of recurrences. These associative transformations have already been widely applied for balancing the tree height of expression trees in HLS [Nicol 91a]. How- ever, a loop containing recurrences must be restructured differently to minimize the length of the loop recurrences. This idea has been previously studied in the DSP domain [Iqbal 93] but to our knowledge, has not yet been widely applied in HLS. We compared our techniques to existing prior work in HLS loop pipelining and also compared against a state-of-art commercial HLS tool. Over a suite of benchmarks, we show that our scheduler and proposed optimizations can result in a geomean wall-clock time reduction of 32% versus prior work and 29% versus a commercial tool. The remainder of this chapter is organized as follows: Section 5.2 presents related work and relevant background. Section 5.3 gives an overview of loop pipelining and introduces a motivating example. Section 5.4 describes our modulo SDC scheduling algorithm. Section 5.5 discusses our data dependency restructuring transformations to reduce loop recurrence cycles. Section 5.6 presents an experimental study and Section 5.7 draws conclusions.

5.2 Preliminaries

5.2.1 Related Work

Loop pipelining can be performed using software pipelining, which is a compiler technique traditionally aimed at Very Long Instruction Word (VLIW) processors [Lam 88]. VLIW processors [McNai 03] can execute multiple instructions in the same clock cycle allowing them to exploit instruction-level paral- lelism. Software pipelining uncovers instruction-level parallelism between successive iterations of a loop, and reschedules the instructions to exploit these opportunities. Iterations of a loop are initiated at constant time intervals, before the previous iterations are complete. Software pipelining is performed using modulo scheduling [Rau 81], which we will discuss in more detail in Section 5.2.2. One common software pipelining heuristic is called Iterative Modulo Scheduling (IMS) [Ramak 96], which has been adapted for loop pipelining in high-level synthesis by PICO [Schre 02]. Iterative modulo scheduling combines list-scheduling, backtracking, and a modulo reservation table to reorder instructions from multiple loop iterations into a pipelined schedule. IMS, in its original form [Ramak 96], did not consider HLS operator chaining, as chaining is not applicable to VLIW architectures. The authors of the HLS tool PICO [Sivar 02] studied the impact of adding chaining capability to IMS, which is non- trivial and requires adding an approximate static timing analysis to the inner loop of the algorithm. However, they focused on area improvements assuming a fixed pipeline throughput and did not consider Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 60 the impact of chaining on loop recurrences. Another software pipelining heuristic used by GCC is swing modulo scheduling, which tries to reduce register pressure [Hagog 04]. Register pressure at any point in a program is equal to the number of live variables that must be stored in machine registers [Chait 81]. If register pressure exceeds the number of available machine registers then we must “spill” variables into main memory. Spilling to memory slows down program execution. During modulo scheduling we have some flexibility as to when instructions are scheduled. Swing modulo scheduling tries to schedule dependent instructions as close as possible to shorten variable livetimes, reducing register pressure and avoiding spilling to memory. Earlier approaches to loop pipelining in high-level synthesis determined the pipeline datapath through the following process [Potas 90]. First, they unroll the loop by one iteration. This entails duplicating all basic blocks in the loop body and then connecting the last basic block of the loop body to start of the duplicated basic blocks. They compact this new loop body by applying code motions that move operations upwards across basic block boundaries in the new loop body. Operations migrate towards the earlier loop iteration and upwards motion is only limited by dependencies between operations. Unrolling is continued until the (provable) emergence of a repeating pattern of code, which will contain all loop recurrences. This pattern then becomes the new compacted loop body, which exposes all of the available loop parallelism to standard HLS scheduling. The code motions used to compact the loop body are described in a compiler technique called percolation scheduling [Nicol 85]. They describe local transformations that move one operation from a basic block to the immediately preceeding basic block(s) if no dependencies are broken. These local transformations iteratively “percolate” operations upwards in the control flow graph towards the start of the program. Modulo scheduling has been shown to perform better than these iterative unrolling loop pipelining techniques for single basic block loops [Jones 91]. Loops with control flow in the loop body will have multiple basic blocks, which must be merged together into one hyperblock [Mahlk 92] using if-conversion before modulo scheduling [Warte 92]. Recently, a heuristic using SDC-based scheduling to perform modulo scheduling was proposed [Zhang 13]. This work used an SDC-based scheduling formulation with an objective function to minimize register pressure and compared the register usage to swing modulo scheduling. Their scheduling algorithm is similar to the one proposed in this chapter but uses a greedy heuristic to choose operations to be sched- uled, prioritizing operations that minimize the impact on operations still to be scheduled. We take an alternative approach by abandoning any infeasible partial schedules and then backtracking by attempt- ing other possible scheduling combinations. Backtracking can lead to better schedules than the greedy approach in cases where the priority ordering prevents the discovery of a valid schedule in a single pass. The work in [Nicol 91a] presents a method for incrementally reducing the height of an expression tree from O(n) to O(logn) in high-level synthesis by using associative and distributive transformations. They find trees of dependent arithmetic operations within the program data flow graph and then apply these transformations to balance the height of each operation in the tree. The goal being to minimize the longest dependency chain of operations in the data flow graph, which limits the total HLS schedule length. They did not consider applying their transformations to recurrences during loop pipelining like we describe in this chapter. Tree height restructuring has also been investigated for software pipelining in the Cydra compiler [Schla 94] when targeting loops with recurrences. But they focused on VLIW processors with limited instruction level parallelism instead of the flexible high-level synthesis architecture we study here. The work in [Iqbal 93] presents an approach of using algebraic transformations and register retiming to restructure a pipelined data flow graph. They apply these transformations to minimize the longest Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 61

Figure 5.1: Time sequence of a loop pipeline with II=2 and five loop iterations (i = 0 to 4). chain of dependent operations that are limiting the HLS schedule length. Their algorithm allows an arbitrary timing constraint on each operation, for instance an input arrival time for the first operation and an output ready time for the last operation in a streaming application. Their work is the most applicable to the transformations we describe in this chapter, but we focus on the specifics of how to apply these transformations to modulo scheduling in HLS.

5.2.2 Background: Loop Pipeline Modulo Scheduling

LegUp performs loop pipelining in two steps, which we will discuss in this section. First, we schedule the operations in the loop using modulo scheduling. Second, we generate the pipeline datapath and control signals in hardware corresponding to this schedule. The modulo scheduling algorithm assumes that the loop has exactly one basic block. If the loop body has multiple basic blocks then we must perform if-conversion to remove control flow and leave us with one basic block [Warte 92]. LegUp’s if-conversion pass implemented by Joy (Yu Ting) Chen is currently limited to simple control flow. Modulo scheduling rearranges the operations from one iteration of the loop into a schedule that can be repeated at a fixed interval without violating any data dependencies or resource constraints. This fixed interval between starting successive iterations of the loop is called the initiation interval (II) of the loop pipeline. The best pipeline performance and hardware utilization is achieved with an II of one, meaning that successive iterations of the loop begin every cycle, analogous to a MIPS processor pipeline. If the first iteration of the pipelined loop takes T cycles to complete, then the total number of cycles required to complete a loop with N iterations is T + (N − 1) × II ≈ N × II, for N ≫ T . Consequently, we can significantly improve pipeline throughput by minimizing the initiation interval. If we are pipelining a loop that contains neither resource constraints or cross-iteration dependencies then the initiation interval will be one. Furthermore, in this case we can use a standard scheduling approach as described in Section 2.6, which will correctly schedule the loop into a feed-forward pipeline. However, when the loop does contain constraints then the initation interval may have to be greater than one. For instance, if two memory operations are required in the loop body but only a single memory port is available then the initiation interval must be two. In this case, modulo scheduling will be required because standard scheduling has no concept of an initiation interval. Standard scheduling assumes that operations from separate control steps do not execute in parallel when satisfying resource constraints, which is no longer true in a loop pipeline. For instance, the standard approach may schedule the first memory operation in the first time step and the second memory operation in the third time step, but if Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 62 new data is entering the pipeline every two cycles then these memory operations will occur in parallel and conflict with the single memory port. To illustrate a loop pipeline, we consider a loop with five iterations pipelined with an initiation interval of two cycles and with three pipeline stages. Figure 5.1 shows the time sequence of the pipeline with time increasing from left to right. During the prologue the hardware pipeline “fills up”, while during epilogue the pipeline “flushes out” as no further loop iterations remain. Each box in Figure 5.1 is labeled with the loop iteration that is occupying the pipeline stage at that moment in time. A loop pipeline stage is analogous to stages in a processor pipeline, where each stage of the pipeline executes in parallel. Each pipeline stage takes two cycles to complete, corresponding to the initiation interval, with the first loop iteration completing after six cycles. At any time step in the steady state operation of the pipeline, we are executing operations from three consecutive iterations of the loop, each iteration activating a different pipeline stage. For instance in Figure 5.1, when the pipeline initially reaches first steady state, loop iterations: i = 0, i = 1, and i = 2 are all executing, and iteration i = 0 is finishing. The number of pipeline stages depends on when the last scheduled operation finishes and is independent of the initiation interval of the pipeline. For instance, assume we are using a pipelined divider functional unit, where new inputs can be passed in every cycle and the output is valid after 32 clock cycles. If we use this divider in the loop pipeline, then we would require at least 32 pipeline stages but the pipeline could still have an initiation interval of one. More stages will result in a longer overall latency by increasing time spent in the prologue and epilogue of the pipeline but this is typically a small fraction of the time spent in steady state with many iterations. We can compare the pipeline in Figure 5.1 to the sequential schedule of the same loop. Sequentially, the loop body would have been scheduled with up to six cycles, which are now split into three pipeline stages. If we assume the original loop body required all six cycles, then five iterations would complete in 30 cycles, compared to the pipelined case in Figure 5.1 which completes after 14 cycles including the prologue and epilogue. As the number of loop iterations increases, the cycles need to complete the loop widens between sequential (6N) and pipelined (about 2N) implementations, eventually reaching three times faster. If we compare the final circuits, the number of functional units in the datapath of the pipelined loop is equal to the number in the sequential loop, assuming no resource sharing, although for the pipelined loops the datapath may have additional registers between pipeline stages. The main difference is in the control logic. The sequential loop is controlled by a finite state machine, while the pipelined loop is controlled by a shift register and a counter (see Section 5.2.3). The performance gain of pipelining is due to activating multiple hardware functional units in parallel, which increases hardware utilization compared to the sequentially scheduled loop. Loop recurrences can increase the initiation interval required for a feasible pipeline schedule. Fig- ure 5.2(a) illustrates the data dependency graph of a loop performing an accumulation across iterations: sum = sum + a[i]+ i. The directed edges in the graph represent data dependencies between operations and the edge labels indicate the required clock cycle latency between operations. We assume that both memory loads and addition operations have a latency of one cycle. In this case, sum in the current iteration has a loop-carried dependency on the sum calculated in the previous iteration, therefore the loop contains a recurrence, indicated by the cycle in the data flow graph. The back edge has a depen- dency distance of one (next iteration) labeled in square brackets. Consequentially, when we perform loop pipelining, the best schedule has an initiation interval of two as shown in Figure 5.2(b). The recurrence Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 63

sum load a[i] Cycle: 0 1 2 3 4 5 6 1 load a[0] load a[1] load a[2] i sum=0 + + + 1[1] 1 sum i=0+ i=1 + i=2 +

(a) Loop Dependency (b) Loop pipeline schedule for the first three loop iterations (II=2). Graph.

Figure 5.2: Loop pipelining with a recurrence.

for (i = 0; i < N; i++) { sum = sum + a[i] + i; }

Figure 5.3: C code for loop.

prevents us from the ideal case of scheduling a loop iteration every clock cycle. However, if we could have chained the additions into a single cycle then we could have achieved an II=1.

In general, we can calculate the minimum recurrence constrained initiation interval (recMII) in the following manner: for every loop recurrence i, or cycle in the data dependency graph, we take the sum of operator clock cycle latencies along the entire path of the recurrence cycle, delayi, and divide by the dependency distance of the recurrence, distancei, and round up. The dependency distance is the number of iterations separating the destination operation from the source operation of the recurrence back edge. The recMII is calculated by taking the maximum over all recurrences in the dependency graph: recMII = maxi ⌈delayi/distancei⌉. We can intuitively think of the delayi as the number of cycles needed after the previous iteration to calculate a result needed in the next iteration. We cannot start the next iteration for a minimum of delayi cycles. Or when distancei > 1 then the result is needed in distancei iterations and we know that each iteration takes II cycles to complete.

Resource constraints can also limit the minimum initiation interval. For instance, if we schedule a loop pipeline with three multiply operations but with only one multiplier unit in the datapath, we must wait three cycles before starting each new loop iteration. In general, we calculate the resource constrained minimum II (resMII) by taking every resource type, i, and calculating the number of operations in a loop iteration using that resource, #opsi, divided by the number of functional units available, #FUi, and round that up to the nearest integer. We take the maximum over all resource types to give us: resMII = maxi ⌈#opsi/#FUi⌉. Many resources are typically unconstrained in HLS, for instance adders, in contrast to general purpose processors which have a fixed number of functional units.

The modulo scheduling algorithm begins by calculating a lower bound on the initiation interval called the minimum II (MII). Any legal schedule must have an II greater than or equal to the MII, but the MII is optimistic and may not be feasible. We calculate the MII by taking the maximum of both the resource constrained MII (resMII) and the recurrence constrained MII (recMII), MII = max(resMII,recMII). Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 64

(a) data dependency (b) Loop pipeline schedule. graph.

(c) Loop pipeline datapath.

Figure 5.4: Loop pipelining Figure 5.3 with II=2.

5.2.3 Background: Loop Pipeline Hardware Generation

We will briefly describe the step after modulo scheduling, which generates the loop pipeline hardware. Scheduling determines the initiation interval of the pipeline and the start and finish time for each operation. To illustrate, we will use the code shown in Figure 5.3 as an example, where we assume memory loads have a latency of two cycles and addition operations have a latency of one cycle. We show the modulo schedule in Figure 5.4(b). The numbers along the top of Figure 5.4(b) show the cycle count when an operation is scheduled. Each operation repeats every two cycles after the first iteration because the initiation interval is two. During the first iteration, we have scheduled the load with startTime = 1 and finishTime = 3, the adder A1 at StartTime = FinishTime = 3, and adder A2 at startTime = FinishTime = 4. The number of pipeline stages is determined by the operation that is scheduled last: pipelineStages = ⌈(lastTime + 1)/II⌉. Here A2 is scheduled last at lastTime = 4, therefore we have three pipeline stages (⌈(4+1)/2⌉). By inspection, the load is in the first pipeline stage, adder A1 is in the second stage, and adder A2 is in the third stage. At the start of the fourth cycle, the prologue of the pipeline is done, and we are now in steady state with all pipeline stages active. After scheduling, we generate the datapath and control logic for the loop pipeline hardware. The pipeline datapath is almost identical to the sequential non-pipelined datapath generated by LegUp but with two differences. First, we need to keep track of the loop induction (index) variable for each pipeline stage, because each stage will have a different iteration executing. We store the induction variable in a three stage shift register shown at the bottom of Figure 5.4(c). Second, we add a shift register with N stages on the output of any operation that is used N pipeline stages later. For example, if the input to an operation is scheduled to finish two pipeline stages earlier, then two registers are needed. If the operation inputs are scheduled in the same pipeline stage then no registers are needed. LegUp generates the minimum number of registers required based on which pipeline stage each operation is scheduled Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 65

Cycle: 0 1 2 3 4 5 6

load b[0] load b[1] load b[2] A load a[i] load b[i] B load a[0] load a[1] 2 2 + + C 1 [1] store store Port Conflict 0 a[1] a[2] store a[i+1] D Prologue Steady State (a) Loop Dependency Graph. (b) Greedy Modulo Scheduling.

Cycle: 0 1 2 3 4 5 6 7 8

load b[0] load b[1] load b[2] load a[0] load a[1] load a[2]

+ +

store store a[1] a[2]

Prologue Steady State (c) Optimal Modulo Schedule. Figure 5.5: SDC Modulo Scheduling for II=3. and which pipeline stages each operation is used. In Figure 5.4(c), the result of the load is used exactly two cycles later (which matches the memory latency). Therefore, adder A1 can be connected directly to the memory output. The induction variable is an input to the A2 adder, and since A2 is scheduled in pipeline stage three, we connect adder A2 to the third register in the induction variable shift register. The pipeline control logic determines when each functional unit in the datapath should be active. There are two main control signals required for each pipeline: valid and ii count. The one bit valid shift register has lastTime registers, which is four in this case. If the pipeline has valid input data and the loop still has more iterations then we shift a one into the valid shift register. The ii count is a counter that repeatedly counts from 0 to II-1 and is only needed if the initiation interval is greater than one. In this case, ii count is a one bit counter alternating between zero and one. A datapath functional unit should be active if the valid shift register corresponding to the scheduled time (T ) is high, and ii count is equal T mod II. The valid shift register ensures that the inputs are valid for each time slot, and the ii count counter ensure that each operation is performed only once per pipeline stage. For example, in Figure 5.4(c) the register driven by A1 will be enabled when valid register three is high and ii count is equal to one (3 mod 2). The register driven by A2 will only be enabled when valid register four is high and ii count is equal to zero (4 mod 2). Finally, the induction variable shift register only shifts every two cycles, when ii count is equal zero.

5.3 Motivation

5.3.1 Greedy Modulo Scheduling Example

In this section, we will illustrate the prior work, greedy modulo scheduling [Zhang 13], for a loop con- taining both cross-iteration dependencies that cause recurrences in the loop data flow graph and also resource constraints. A greedy modulo scheduling algorithm will not achieve an optimal schedule with Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 66 the minimum possible initiation interval if we schedule an operation to a particular time step that later turns out to be wrong. Therefore, greedy scheduling is highly dependent on our chosen priority ordering function. We present the loop data dependency graph given in Figure 5.5(a). We have labeled the operations A, B, C, and D for convenience. The directed edges in the graph represent data dependencies between operations and the edge labels indicate the required clock cycle latency between operations. We assume memory latencies of two cycles for a load and one cycle for a store, and we allow the adder to be chained with zero latency. The back edge from node D to node A represents a cross-iteration data dependency with a dependency distance of one (next iteration) labeled in square brackets. The total delay along the recurrence is three cycles, therefore the recMII is three (⌈delay/distance⌉ = ⌈3/1⌉ = 3). We assume one memory port giving a resMII of three (⌈#ops/#FU⌉ = ⌈3/1⌉ = 3). Modulo scheduling specifies that an operation scheduled at time t will be repeated every II clock cycles. Given resource constraints, we keep track of available resources using a table, where each row tracks a resource and each column is an available time slot. When we schedule an operation at time t, we reserve a single time slot in column t mod II of the table and in the appropriate resource row. Consequently the table is called the modulo reservation table (MRT) and has II time slot columns. Returning to the example, the minimum II is three and the MRT has three time slots available for the memory in Figure 5.5(a). First, we will attempt to greedily modulo schedule the loop. We will schedule operations prioritized in order of perturbation, a typical priority function [Zhang 13], which gives precedence to operations that will most impact the schedule when moved. Therefore, the order of precedence is B (affecting C, D, and A), followed by A, and then D. First, we schedule B into time step zero, and reserve the single memory port for that time. Next, we attempt to schedule A into time step zero but the memory port is occupied by B, so we schedule A into the next time step at time one. After scheduling both loads we attempt to schedule the store operation in cycle three. But a load (B) is already scheduled in cycle three causing a memory conflict, as shown in Figure 5.5(b). This schedule is not possible due to our single-ported memory. In this case, the greedy approach fails to achieve the minimum initiation interval of three. There is now no feasible place to schedule the store operation due to the recurrence constraint and the previously scheduled loads. At this point, we must give up and increase the initiation interval to four and try again. However, we can avoid this suboptimal greedy solution by unscheduling one of the load operations and backtracking to find the schedule shown in Figure 5.5(c). This schedule is optimal and achieves the minimum initiation interval of three. Generally, a greedy modulo scheduler is only guaranteed to yield an optimal schedule with an II equal to the minimum II if the loop has only (1) simple recurrence circuits involving a single operation, or (2) if each operation in the loop is pipelined with II=1 (no multi-cycle operations), in all other cases greedy scheduling may fail to find the optimal solution [Ramak 96].

5.4 Modulo SDC Scheduler

In this section, we describe our novel Modulo SDC Scheduler. We begin with a candidate II based on the pre-calculated minimum II and increment the II when we fail to find a feasible schedule. Given an II, we can use SDC-based scheduling (described in Chapter 2) to quickly give us the control step for every operation in the loop. An advantage we gain from the SDC formulation is the support for operator Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 67 chaining and frequency constraints. To support modulo scheduling, we modify the SDC constraints that specify dependencies between operations by adding an additional term to account for loop recurrences. For two dependent operations i → j, the constraint becomes:

endi − startj ≤ II × distance(i, j) (5.1)

Here startj is the starting cycle time of operation j, and endi is the cycle time when the output of operation i is available. The dependency distance, which is the number of loop iterations separating the dependency, is given by distance(i, j). If there is no loop-carried dependency then the distance will be zero and this constraint will reduce into a standard SDC data dependency constraint. The loop initiation interval, II, is fixed for each iteration of the algorithm. We also add SDC timing constraints between operations to enforce a frequency constraint during scheduling and to prevent excessive chaining from lowering the desired clock period. Unfortunately, resource constraints during modulo scheduling cannot be modeled using the SDC- based linear programming formulation due to the non-linearity of the modulo reservation table. There- fore, we apply an iterative backtracking approach to legalize the SDC modulo schedule. First, we ignore all resource constraints and then we incrementally assign each resource-constrained operation to a par- ticular control step in the schedule, depending on availability in the modulo reservation table (MRT). In some cases, after fixing one or more resource constrained operations, the schedule will no longer be feasible, in which case we backtrack by unscheduling the tentatively scheduled operations and resuming our attempts.

Algorithm 1 Modulo SDC Sched(II,budget) 1: Schedule without resource constraints to get ASAP times 2: schedQueue ← all resource constrained instructions 3: while schedQueue not empty and budget ≥ 0 do 4: I ← pop schedQueue 5: time ← scheduled time of I from SDC schedule 6: if scheduling I at time has no resource conflicts then 7: Add SDC constraint: tI = time 8: Update modulo reservation table and prevSched for I 9: else 10: Constrain SDC with GE constraint: tI ≥ time +1 11: Attempt to solve SDC scheduling problem 12: if LP solver finds feasible schedule then 13: Add I to schedQueue 14: else 15: Delete new GE constraint 16: Backtracking(I,time) 17: Solve the SDC scheduling problem 18: end if 19: end if 20: budget ← budget − 1 21: end while 22: return success if schedQueue is empty otherwise fail

Algorithm 1 gives the pseudocode for our iterative algorithm. The input to this function is the initiation interval and a budget, which will be described shortly. First, we schedule the loop without Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 68 resource constraints and we save the ASAP time for each operation. Next we initialize a queue of all resource constrained operations. We take the first operation out of the queue, which could be a priority queue based on height [Ramak 96] or perturbation [Zhang 13] but neither is required due to backtracking. However, having a good priority function will reduce the execution time of the algorithm. Next, we check the MRT for resource conflicts at the time step given by the SDC scheduler. In the first iteration, the SDC time step will be identical to the ASAP time calculated earlier. However, as we add constraints to the SDC formulation, the SDC time steps may begin to diverge from the ASAP times. If there are no MRT resource conflicts then we tentatively assign the operation to that time step by adding an equality constraint to the SDC formulation and we update the MRT and the previous scheduled time for I (lines 7–8). Otherwise, we try to reschedule with that operation constrained to a greater time step (lines 10–11). If we find a feasible schedule then we add this instruction back into the queue for later scheduling (lines 12–13). If we cannot find a feasible schedule (lines 15–17), then we backtrack by unscheduling one or more already scheduled resource constrained instructions and then schedule the current instruction. This process is continued until either a legal schedule is discovered with all resource constrained instructions fixed to a specific time slot or when a budgeted number of while loop iterations have occurred, upon which we consider the current fixed II to be infeasible and increment the II. The budget parameter is equal to the budgetRatio × numInstructions, where we have observed empirically that a budgetRatio = 6 (as was also found by [Ramak 96]) works well to avoid excessive backtracking. If budgetRatio = ∞ then we will backtrack through all possible schedules guaranteeing that we find the optimal schedule that meets the II constraint. However, if the schedule is infeasible for the II constraint then we will have an infinite loop.

Algorithm 2 Backtracking(I,time) 1: for minTime =ASAP time of I to time do 2: SDC schedule with I at minTime ignoring resources 3: break if LP solver finds feasible schedule 4: end for 5: prevSched ← previous scheduled time for I 6: if no prevSched or minTime ≥ prevSched then 7: evictTime ← minTime 8: else 9: evictTime ← prevSched +1 10: end if 11: if resource conflict scheduling I at evictTime then 12: evictInst ← instr. at evictTime mod II in MRT 13: Remove all SDC constraints for evictInst 14: Remove evictInst from modulo reservation table 15: Add evictInst to schedQueue 16: end if 17: if dependency conflict scheduling I at evictTime then 18: for all S in already scheduled instructions do 19: Remove all SDC constraints for S 20: Remove S from modulo reservation table 21: Add S to schedQueue 22: end for 23: end if 24: Add SDC constraint: tI = evictTime 25: Update modulo reservation table and prevSched for I Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 69

Table 5.1: Algorithm Example (II=3). MRT Slot SDC Time Sch. Time Iter I Description 0 1 2 B A D B A D

1 B 0 0 2 0 B Sched. tB =0 2 B 0 1 3 0 A Conflict. tA ≥ 1 3 B A 0 1 3 0 1 A Sched. tA =1 4 D A 0 1 3 1 3 D Evict B. tD =3 5 D A 1 1 3 1 3 B Conflict. tB ≥ 1 6 D B 1 1 3 1 3 B Evict A. tB =1 7 A 0 2 4 2 A Evict All. tA =2 8 B A 0 2 4 0 2 B Sched. tB =0 9 B D A 0 2 4 0 2 4 D Sched. tD =4

Algorithm 2 gives the pseudocode for our backtracking stage, which takes as input an operation I to be scheduled at control step time. First, we find a valid time slot while ignoring resource constraints but considering data dependencies (lines 1–4). Because we ignore resource constraints of the partial SDC schedule, we will always find a minimum time slot and break out of the loop on line 3. In lines 5–10, we ensure forward progress by storing the previous scheduled time (updated on line 24 or line 8 of Algorithm 1) of each operation to prevent attempting a time step before that point. This prevents two operations from displacing each other back and forth during backtracking. We remove any resource conflicts at the candidate scheduling time by unscheduling the tentatively scheduled operations found in the MRT at that slot (lines 11–15). In some cases, the previous scheduling time pushes forward the schedule time of an operation such that there is also a data dependency conflict at the candidate time. In this case, we unschedule all other operations to ensure forward progress and add these operations back into the queue to be rescheduled (lines 16–22). Finally, we schedule the operation at the new time step by updating the MRT and previous scheduled time for I, then we add an equality constraint to the SDC formulation.

5.4.1 Detailed Scheduling Example

In this section, we walk through the exact steps of our scheduling algorithm for the loop data flow graph previously provided in Figure 5.5(a). We begin Algorithm 1 by performing SDC scheduling without resource constraints, giving us the ASAP times: tA = 0,tB = 0,tC = 2,tD = 2. Here, we assume schedQueue is prioritized by perturbation [Zhang 13], giving precedence to operations that will most impact the schedule when moved—although this is not required. The queue contains B (affecting C, D, and A), followed by A, and then D. We skip C because adders are not resource constrained. Table 5.1 provides record keeping for the end of each iteration (first column) of the algorithm. The “MRT slot” column lists the operations reserved in each time slot of the memory MRT, the “SDC Time” column gives the operation control steps under the current SDC constraints, “Sch. Time” gives the tentatively scheduled time of each operation (blank if not scheduled), “I” gives the current instruction I, and “Description” summarizes what occurred during the iteration. In the first iteration, we pop B off the queue and find no resource constraints at time 0, so we add the

SDC constraint tB = 0 and reserve MRT slot 0. Next iteration, we try to schedule A but find a resource conflict with B, so we update the SDC with tA ≥ 1 and re-solve the linear program (LP). Next, we schedule A at time 1 and reserve MRT slot 1. In iteration 4, we try to schedule D at time 3 but MRT slot Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 70

0 (3 mod 3 = 0) is unavailable. We constrain tD ≥ 4 and re-solve but the SDC constraints are infeasible due to the recurrence with A. At this point a greedy algorithm would give up and increment the II, as shown in Figure 5.5(b). Instead, we call backtracking(D, 3), where we calculate D’s minTime to be 3. Therefore, we evict B from the MRT at slot 0, and we can now schedule D at time 3. Next iteration, we find a resource conflict scheduling B at time 0, so we add the constraint tB ≥ 1. In iteration 6, we try B at time 1 but there is still a resource conflict, and tB ≥ 2 is not feasible due to the recurrence. We call backtracking(B, 1) and get minTime = 0 but B has already been previously scheduled at time 0, so we schedule B at time 1 and kick out A from the MRT. In iteration 7, we have a resource conflict scheduling A at time 1, tA ≥ 2 is infeasible, so we call backtracking(A, 1). A has been previously scheduled at time 1, so we schedule at time 2 which conflicts with the recurrence so we evict all other operations. The algorithm continues as shown in Table 5.1 until we find a valid modulo schedule for

II = 3 with tA = 2,tB = 0,tD = 4. At this point the SDC scheduled time for operations without resource constraints is also valid, in this case tC = 4 (the addition). We now have a final schedule with the optimal II of three, as shown in Figure 5.5(c). We make an observation for future work, our algorithm only seeks to minimize the initiation interval but not necessarily the number of stages in the pipeline. We have observed that in some cases the scheduled pipeline will have one extra cycle of latency due to the ordering of the operations. We considered this a minor concern compared to the throughput of the pipeline.

5.4.2 Complexity Analysis

Assuming there are n operations in the loop with m SDC constraints (operation dependencies) then solv- ing the SDC scheduling problem incrementally has a worst case time complexity of O(m+nlogn) [Ramal 99]. The budget parameter limits the amount of backtracking in the algorithm and has an O(n) com- plexity. Each resource constrained operation has II possible slots in the reservation table, therefore it can be rescheduled up to II times. Therefore, the overall time complexity of this algorithm is O(n · II · (m + nlogn)). In practice, only a few of the n operations have resource constraints and the amount of backtracking rarely reaches the budget parameter limit.

5.5 Loop Recurrence Optimization

Data flow graph transformations have been well-studied in prior work [Iqbal 93, Schla 94, Nicol 91a]. We propose a targeted manner of applying these transformations specific to HLS modulo scheduling. The goal of these transformations is to reduce the length of loop recurrence cycles in the loop data dependency graph and improve the achievable initiation interval. We will first illustrate this concept by describing the impact of an associative transformation on a loop with a cross-iteration dependency, with the C code given by Figure 5.3. In the loop, the sum variable in the current loop iteration depends on the sum calculated in the previous iteration. The loop has two equivalent data dependency graphs due to the associative property of addition: i+(load+sum) and (i+load)+sum. Figure 5.4(a) gives the former data dependency graph, i+(load+ sum), which is the default graph generated by LLVM. For this example we assume that operator chaining is not allowed, that is, every addition takes one cycle to complete. The edge labels in the dependency graph indicate the number of cycles required between operations. Loop recurrences are indicated by back edges in the graph, where the dependency distance is given in square brackets, in this case the Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 71

(a) data dependency graph. (b) Loop pipeline schedule.

(c) Loop pipeline datapath.

Figure 5.6: Restructured loop dependency graph achieves II=1 distance is one (the previous iteration). Due to the recurrence spanning across two addition operations, both of which take one cycle to complete, the minimum initiation interval of this loop pipeline is two cycles. Figure 5.4(b) shows the loop pipeline after scheduling and the corresponding datapath is shown in Figure 5.4(c). Alternatively, we can restructure the data dependency graph using the associativity property of addition: (i + load)+ sum. In this new data dependency graph, shown in Figure 5.6(a), the length of the recurrence has been reduced and the loop can now be scheduled with an initiation interval of one. Figure 5.6(b) shows the new schedule for the loop pipeline and the corresponding datapath is shown in Figure 5.6(c). Due to the improvement in the initiation interval, this new pipeline will have approximately twice the throughput of the original pipeline in Figure 5.4. Based on this example, we can conclude that the structure of the data dependency graph is critical for obtaining high performance loop pipelines. In this research, we propose restructuring the data dependency graph by applying associativity and distributivity rules to improve the minimum initiation interval. There is related work in [Nicol 91b], which presents a method for incrementally reducing the height of an expression tree from O(n) to O(logn). We propose extending this algorithm to consider restructuring an expression tree for the benefit of one particular critical path, which we would prefer to incur the least latency. To illustrate, we will consider a loop that accumulates the sum of seven arrays over all array indices: sum = sum + a[i]+ b[i]+ c[i]+ d[i]+ e[i]+ f[i]+ g[i]. Figure 5.7(a) shows the default data dependency graph assuming left-to-right associativity. The dotted lines in the figure indicate control steps after scheduling, where arrows that cross the dotted line require registers. For this example, we assume that operator chaining is not allowed, that is, every addition takes one clock cycle to complete. In this Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 72

a[i] a[i] b[i]

b[i] c[i]

c[i] d[i] d[i] e[i]

e[i] a[i]b[i] c[i] d[i] e[i] f[i] g[i] f[i]

f[i] g[i]

g[i]

(a) Original (b) Tree Height Reduction 3 cycles/iter. (c) Restructured 7 cycles/iter. 1 cycle/iter.

Figure 5.7: Dependency graph restructuring. earlyParent: a[i] early: b[i] lateParent: sum earlyParent: a[i]

lateParent: sum early: b[i] late

curOp curOp (a) (sum + a[i]) + b[i] (b) sum + (a[i] + b[i])

Figure 5.8: Incremental Associativity Transformation. case, sum in the current iteration has a loop-carried dependency on the sum calculated in the previous iteration (a dependency distance of one). The loop recurrence spans across seven addition operations, having a path delay of seven clock cycles. Therefore the minimum initiation interval of this loop pipeline is seven cycles (recMII = ⌈7/1⌉ = 7). The typical approach in HLS is to balance the expression tree. For instance, we could use the tree height reduction algorithm from [Nicol 91a] to obtain the height balanced tree in Figure 5.7(b). We have now reduced the path length of all inputs to three cycles improving the minimum initiation interval to three. While this loop pipeline is more than twice as fast as Figure 5.7(a), the minimum initiation interval is still constrained by the loop recurrence. In our proposed approach, we restructure the expression tree to incur the least latency along the loop recurrence. By targeting loop recurrences, we can focus on improving the minimum initiation interval and consequently the loop pipeline performance. First, we find all operations in the graph that are contained within a loop recurrence. To determine all recurrences in a loop’s data dependency graph, we solve the equivalent problem of finding all elementary cycles in the graph. An elementary cycle is a path through a graph where the first and last vertices are identical and no other vertex appears twice. All elementary cycles in a graph can be found in polynomial time [Hawic 08], and each cycle corresponds to a loop recurrence. If the graph contains multiple recurrences, we rank the recurrences by their respective impact on the initiation interval. The rank of each recurrence is found by calculating the recMII of the recurrence in isolation and then ranking the recMII values from high (most critical) to low. Each operation in a recurrence inherits this ranking. Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 73

Table 5.2: Minimum initiation interval of benchmarks for balanced vs. proposed restructuring Balanced Restructuring Restructuring Benchmark recMII resMII MII recMII resMII MII faddtree 26 23 26 13 23 23 adderchain 2 2 2 2 2 2 multipliers 2 2 2 2 2 2 dividers 2 2 2 2 2 2 complex 3 3 3 3 3 3

Table 5.3: Operation and dependency characteristics of each benchmark Operations Constraints Benchmark +/fadd/*/%/[] +/fadd/*/%/[] Distance Total Instr faddtree 0/21/0/0/22 X/1/X/X/2 1 80 adderchain 40/0/0/0/26 X/X/X/X/2 1 92 multipliers 6/0/2/0/10 X/X/2/X/2 1 30 dividers 11/0/0/4/13 X/X/X/X/2 1 72 complex 16/0/7/2/27 X/X/3/X/2 9 98

Next, we apply transformations incrementally to the graph to reduce the path length of recurrences. For example, Figure 5.8(a) shows the first two addition operations from the original data dependency graph Figure 5.7(a), corresponding to the expression: (sum + a[i]) + b[i]. The left operand of the first addition is part of the loop recurrence that we wish to improve. We use associativity to restructure these two operations into an algebraically equivalent expression, sum + (a[i]+ b[i]), as shown in Figure 5.8(b). This transformation has reduced the length of the recurrence by one cycle. In general, if we consider additions, an associative transformation involves two two-operand operations that form a recurrence: late = lateParent + earlyParent, and curOp = late + early. Here lateParent and late are the critical edges along which the recurrence occurs. In this case, we use the associative property of addition to transform this into: curOp = lateParent + (earlyParent + early). In this new expression, we have removed one addition operation from the recurrence, leaving only lateParent. Repeating this associative transformation incrementally, we eventually obtain the restructured data dependency graph in Figure 5.7(c). In this new graph, instead of balancing the height of the expression tree, our transformations have actually lengthened some paths in the data dependency graph in order to shorten the recurrence path. The loop recurrence now consists of only one addition, therefore the new loop pipeline has an initiation interval of one (assuming no resource constraints). Due to the improvement in the initiation interval, this new pipeline will have approximately seven times the throughput of the original loop in Figure 5.7(a) (II reduced from 7 to 1). These transformations are particularly effective for recurrences containing multi-cycle operations, for example floating point operations, which can cause long recurrence lengths and are unaffected by operator chaining. We only performed transformations on expressions consisting of operations of all the same type (i.e. integer, floating point).

5.6 Experimental Study and Results

We experimentally evaluated our approach using five C benchmarks containing a loop with the initiation intervals limited by both loop recurrences and resource constraints, all of which are synthesizable by LegUp and the commercial tool. All benchmarks contain a tree of operations with a recurrence. Table 5.2 Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 74

Table 5.4: Speed performance results Initiation Interval Cycles Benchmark Comm Zha Back Back+R Comm Zha Back Back+R faddtree 36 34 26 23 1539 1439 1128 1045 adderchain 4 3 2 2 372 297 209 209 multipliers 3 3 2 2 292 294 206 207 dividers 4 3 2 2 261 230 152 152 complex 4 5 3 3 454 550 382 382 Geomean 5.9 5.4 3.6 3.5 456 437 309 305 Ratio 1 0.92 0.62 0.6 1 0.96 0.68 0.67 shows the impact of restructuring on the loop recurrence. The “Balanced Restructuring” column gives the recurrence minimum II (recMII), the resource MII (resMII), and the combined minimum II (MII) for the default case of balanced expression tree restructuring. The “Restructuring” column gives the same metrics but after restructuring, as described in Section 5.5. In Table 5.2, we can see that restructuring improved the recMII of faddtree from 26 to 13. This was caused when we restructured a floating point addition with a latency of 13 away from the loop recurrence as shown in Figure 5.8. Table 5.3 provide a summary of the properties of each benchmark. The “Operations” column gives the number of additions, floating point additions, multiplications, divisions, and memory operations in the loop body. The “Constraints” column gives the constraint on the number of functional units for adders, floating point adders, multipliers, and memory ports, with an X indicating no constraint. Although we restricted memories to two ports in these benchmarks, memory is spread across multiple independent block RAMs that can be accessed in parallel. The “Distance” column gives the dependency distance of the cross- iteration dependency in the loop. The “Total Instr” column gives the total number of LLVM instructions in the loop, most of which represent binary operations, to measure the scheduling complexity of each benchmark. The benchmarks include golden input and output test vectors, allowing us to synthesize the cir- cuits with a built-in self-test. We used these test vectors to simulate the circuits in ModelSim and verify correctness. We targeted the Stratix IV [Stra 10] FPGA (EP4SGX530KH40C2) on Altera’s DE4 board [DE4 10] using Quartus II 11.1SP2 to obtain area and FMax metrics. Quartus timing constraints were configured to optimize for the highest achievable clock frequency. We benchmarked against a state-of-the-art commercial HLS tool configured to target a commercial FPGA similar to Stratix IV. We used the default commercial tool options, which include standard expression tree balancing. We configured LegUp to use functional units with identical latency as the commercial tool. We imposed a target clock period constraint of 3ns (333MHz) on both HLS schedulers. In our study, we consider four scenarios for comparison: (1) A commercial HLS tool (Comm), (2) Zhang’s recently published greedy modulo SDC scheduler [Zhang 13] (Zha) implemented in LegUp, (3) Our proposed backtracking SDC modulo scheduler (Back), and (4) Our scheduler combined with data dependency graph associative restructuring (Back+R). Table 5.4 and Table 5.5 give speed performance results for these four scenarios. The “Initiation Interval” column is the scheduled II of the loop pipeline. The “Cycles” column is the total number of cycles required to complete the benchmark. The “FMax” column provides the FMax of the circuit given by the Quartus. The “Time” column gives the circuit wall-clock time: Cycles · (1/FMax). Ratios in the table compare the geometric mean (geomean) of the column to the respective geomean in the commercial tool. We also summarize the results in Figure 5.9. Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 75

Table 5.5: Speed performance results FMax (Mhz) Time (µs) Benchmark Comm Zha Back Back+R Comm Zha Back Back+R faddtree 257 229 248 233 5.99 6.28 4.55 4.48 adderchain 270 239 194 234 1.38 1.24 1.08 0.89 multipliers 540 485 545 511 0.54 0.61 0.38 0.41 dividers 236 261 270 270 1.11 0.88 0.56 0.56 complex 269 211 232 232 1.69 2.61 1.65 1.65 Geomean 299 271 277 281 1.53 1.61 1.11 1.09 Ratio 1 0.91 0.93 0.94 1 1.05 0.73 0.71

(a) LegUp versus Prior Work [Zhang 13]

(b) LegUp versus Commercial HLS Tool

Figure 5.9: Backtracking SDC modulo scheduling experimental results. Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 76

Table 5.6: Area comparison experimental results ALUTs Registers Benchmark Comm Zha Back Back+R Comm Zha Back Back+R faddtree 1,266 1,676 1,629 1,638 2,305 2,175 2,240 2,374 adderchain 857 1,190 1,110 1,108 929 1,857 2,178 2,114 multipliers 77 122 124 108 68 173 193 110 dividers 5,395 8,488 5,495 5,495 9,072 13,771 9,732 9,732 complex 6,551 4,166 4,223 4,223 11,571 14,854 17,732 17,732 Geomean 1,242 1,538 1,391 1,354 1,725 2,698 2,768 2,488 Ratio 1.00 1.24 1.12 1.09 1.00 1.56 1.60 1.44

Table 5.7: Tool runtime (s) comparison Benchmark Comm Zha Back Back+R faddtree 16 34 9 59 adderchain 6 6 4 10 multipliers 0.2 0.4 0.2 0.2 dividers 6 3 2 6 complex 7 4 4 5 Geomean 3.8 4.0 2.2 5.1 Ratio 1.00 1.04 0.59 1.34

The results show that our backtracking modulo SDC scheduling approach can have a significant impact on loop pipelines with resource constraints combined with recurrences. Based on our experiments, the commercial tool is using a greedy modulo scheduler for loop pipelining because their schedule cannot achieve the minimum II for these benchmarks. Consequently, our approach achieved a geomean reduction in II by 38% versus the commercial tool. Furthermore, with restructuring we were able to improve this to a 40% reduction in geomean II. We also see that our backtracking approach achieves an average geomean improvement of 33% over Zhang’s greedy approach. Backtracking SDC modulo scheduling reduced the geomean cycle time by 32% versus the commercial tool and by 29% versus Zhang. The cycle time improvement was less than the reduction in II due to time spent outside the loop pipeline in these benchmarks. The geomean FMax decreased by 7% versus the commercial tool when applying our approach, due to better balanced expression restructuring by the commercial tool. When restructuring along recurrences, we chain fewer operations, causing the geomean FMax to increase by 2%. Overall, the geomean wall-clock time for these benchmarks was reduced by 32% using backtracking and restructuring versus greedy SDC modulo scheduling. When compared to the commercial tool, our backtracking and restructuring approach improves geomean wall-clock time by 29%. This improvement is mainly due to a reduction in II caused by better scheduling of our SDC modulo scheduler when compared to greedy scheduling. Table 5.6 gives the area results for the four scenarios. The “ALUTs” and “Registers” columns give the number of Stratix IV combinational ALUTs and dedicated registers required. Table 5.7 compares the tool runtime in seconds for each algorithm to run, this includes the entire flow from C to Verilog. Ratios in the table compare the geometric mean (geomean) of the column to the respective geomean in the commercial tool. Comparing our approach to the commercial tool in terms of area, the geomean combinational ALUTs increased by 9% and the geomean registers increased by 60%. The registers increased due to a lower II in pipelines generated by our approach allowing less register sharing. Comparing our approach Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 77 to Zhang’s in terms of area, the geomean combinational ALUTs decreased by 12% and the geomean registers decreased by 8%.

5.6.1 Runtime Analysis

Now we present a runtime characterization of our new backtracking approach versus the other scheduling algorithms. First, the runtime results in Table 5.7 show that our backtracking scheduler had 41% less geomean runtime than the commercial tool but we had a 43% increase in runtime when using restructuring. Our scheduling algorithm’s runtime is influenced by the number of invocations of the linear program solver, which is proportional to the number of instructions in the loop being scheduled. We would like to know the typical range of instructions found in a loop to be pipelined by HLS. The MediaBench II Video benchmark suite [Fritt] is representative of modern and emerging multimedia DSP applications (MPEG-4, JPEG-2000, H.264) with applications that typically have extensive instruction level parallelism. A study of workload characteristics [Fritt] observed that the average instructions per basic block in MediaBench was 9.4 instructions, with a maximum of 61 instructions in a single basic block. We performed a characterization of the CHStone [Hara 09] benchmarks and observed the median instructions per basic block was 4, while the median instructions per loop was 24. Across the CHStone suite, a single basic block contained a maximum of 378 instructions and a maximum of 805 instructions in a single loop. Therefore, we will perform runtime analysis of our algorithm for basic blocks of size up to 1000 instructions. For this experiment, we used the adderchain benchmark and duplicated the body of the loop N times, where N ranged from 1 to 12, and added a final summation after the loop. Each additional duplication introduced another recurrence into the loop pipeline and increased the number of instructions by 79 in- structions. Figure 5.10 shows the runtime in seconds for each algorithm as the number of instructions in the loop increases. By default, we solved the SDC problem using a linear programming solver [lpso 14]. We also analysed the runtime taken when efficiently solving the SDC problem incrementally after modi- fying the constraints as described in [Ramal 99]. The lines marked with “(incremental SDC)” show these results. However, we observed only a minor runtime difference, leading us to believe that the LP solver is quite optimized. Here we see that our backtracking algorithm’s runtime increases substantially com- pared to the commercial tool as the number of instructions grow, but the absolute runtime is still about 1 minute even for 1000 instructions. Although backtracking runtime compares poorly to the commercial tool, Zhang’s greedy approach is actually no better, because if we fail to schedule for a given II we must iteratively increment the candidate II and attempt to reschedule again. This iterative process can be costly in terms of runtime as seen in Figure 5.10. Figure 5.11 provides the final pipeline II achieved in each case. We observe that the greedy algorithms, both the commercial tool and Zhang, achieve inconsistent pipeline initation intervals due to the resource constraints and cross-iteration dependencies.

5.7 Summary

This chapter demonstrated that resource constraints and loop recurrences can have a considerable im- pact on loop pipelining in HLS by increasing the initiation interval of synthesized pipelines. We proposed a novel backtracking SDC-based modulo scheduling algorithm and a graph restructuring technique for Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 78

1000 Zhang Zhang (incremental SDC) Backtracking Backtracking (incremental SDC) Commercial 100 Runtime (s) 10

1 0 100 200 300 400 500 600 700 800 900 1000 Number of Instructions

Figure 5.10: Runtime Characterization For Loop Pipelining Scheduling Algorithms.

12 Zhang Commercial Backtracking 10

8 II 6

4

2

0 100 200 300 400 500 600 700 800 900 1000 Number of Instructions

Figure 5.11: Initiation Interval for Loop Pipelining Scheduled in Figure 5.10. Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 79 expression height reduction to reduce loop recurrence lengths. Our empirical study on a set of bench- marks containing loop pipelines constrained by resources and limited by recurrences show our approach achieves a 32% improvement in geomean wall-clock versus prior work and 29% versus a commercial tool. Chapter 6

LegUp: Memory Architecture

6.1 Introduction

In this chapter, we describe LegUp’s target memory architecture. Our goal for the memory architecture is to achieve high circuit performance and minimize FPGA memory block usage while supporting C input programs that use memory pointers. The C language targets a computing architecture that has a single address space. In modern computer architectures, the memory is a large off-chip RAM supported by on-chip caches. But in high-level synthesis, we have more flexibility and can generate a custom memory architecture for our particular target application based on its memory use. We will explore various HLS memory architectures in this chapter. A common approach in high-level synthesis is to limit the types of memory pointers that are allowed in the program input. For instance, requiring that each C pointer only point to one array during program execution. HLS users are expected to rewrite their programs to conform to these limitations. In LegUp, we support generic pointers, meaning that a C pointer can point to any location in memory. This increases LegUp’s utility by allowing a wider range of input programs. We will discuss later how LegUp handles pointers that cannot be resolved to a particular memory location at compile time. In LegUp’s hardware-only (no processor) flow, we partition program memory into distributed on-chip FPGA block memories that are either placed globally or locally. Each local memory block is directly connected to the generated circuit datapath at any location where we access that particular memory. The global memory blocks are connected to the datapath through a single shared dual-port memory controller. During a load, the memory controller steers the incoming memory address to the appropriate global memory block and then returns the data from the memory. Local memory blocks can only be accessed within the hardware module in which they are instantiated, while global memory blocks can be accessed from any hardware module in the final design. We will describe the design of LegUp’s shared memory controller and the global memory addressing scheme in Section 6.3. We will show in Section 6.4 that the shared memory controller can limit performance in certain ways. By partitioning the memory space into local and global memory we can increase circuit throughput. We discuss the compiler-time analysis performed by LegUp to partition C arrays from the input program into either global and local memory blocks in hardware. We show empirically that using local memory blocks in LegUp improves the geomean wall-clock time by 8% compared to only using global memory,

80 Chapter 6. LegUp: Memory Architecture 81 when averaged across the CHStone benchmarks. In Section 6.5, we investigate grouping arrays from the C program into shared memory blocks instead of storing each array in a separate memory block. We show in that section that this technique reduces the number of FPGA memory blocks required in the final design. We also discuss how to reduce addressing logic required in the circuit datapath by allocating each array to an appropriate place in the shared memory block. We applied our array grouping approach in an experimental study on the CHStone benchmark suite and found that the geomean memory implementation bit usage decreased by 27%. The remainder of this chapter is organized as follows: Section 6.2 presents related work. Section 6.3 gives an overview of LegUp’s memory architecture and describes the shared memory controller. Sec- tion 6.4 describes local memory blocks and the algorithm we use to partition program memory between local and global memory. Section 6.5 discusses how we can group arrays into shared RAMs to save FPGA memory blocks. An experimental study is presented in Section 6.6. Section 6.7 offers a summary.

6.2 Background

6.2.1 Related Work

In HLS, we can often improve the performance of a pipelined loop by partitioning the input or output arrays into distinct RAMs, or memory banks, allowing for increased memory bandwidth. The Vivado HLS tool [Xili] includes compiler pragmas supporting various forms of memory partitioning. The user can manually combine smaller arrays into a single RAM or partition arrays into many RAMs. They can also reshape an array of many elements into a RAM with larger bitwidth allowing multiple elements to be accessed in parallel. The HLS study by Cong [Cong 12] optimizes C loop nests using on-chip memory reuse buffers, which reduce off-chip memory accesses by buffering array elements accessed in prior loop iterations. They describe loop transformations that reduce the size of these buffers while maintaining the same circuit performance, reducing on-chip FPGA memory bits by 40% on average. Many academic HLS tools, such as SPARK [Gupta 03], GAUT [Couss 10], and ROCCC [Villa 10] only support programs containing C arrays (i.e., int a[]) and disallow all C pointers (i.e., int *a). In GAUT, pointers are allowed as function parameters, which synthesize into single integer output ports (they do not point to memory). Other HLS tools like CoDeveloper [Impu] and CatapultC [Caly] only allow C pointers that point to a single array during program execution, which can be resolved at compile-time. In CoDeveloper, only arrays can be passed as arguments to functions (not pointers). By limiting the types of pointers allowable in the input program, these tools can simplify the target memory architecture; block RAMs are simply connected directly to the final circuit datapath. Semeria and Micheli [Semer 98] from Stanford demonstrated an HLS approach that supports generic C pointers to statically defined memory (no dynamic memory). They implemented their approach in the SUIF C compiler framework [Wilso 94] and use a points-to analysis at compile time to determine the memory locations each pointer can access. Memory instructions that access multiple memory locations are implemented using a multiplexer that selects between each possible memory using a minimally sized pointer (i.e., for two memories use a one bit pointer). The work was extended to support dynamic memory [Semer 01], where they proposed a 32-bit pointer address encoding scheme consisting of a 16- bit tag, representing the memory location, concatenated with a 16-bit offset, representing the byte offset into the location. We describe a similar pointer encoding for LegUp in Section 6.3.2. Semeria only Chapter 6. LegUp: Memory Architecture 82 considered synthesizing a single C function and did not discuss how to handle larger programs that pass pointers between functions. LegUp does not inline all functions by default for the reasons explained in Section 3.4.1. Furthermore, they did not target a hybrid architecture where the hardware can access memory shared with the processor. We have extended their work in this chapter to handle the LegUp hardware architecture. Modern FPGAs have dedicated block RAMs distributed in columns across the chip, resulting in high on-chip memory bandwidth [Under 04]. On-chip memory bandwidth is a key advantage of FP- GAs in comparison to other compute platforms such as GPUs and CPUs [Fu 11]. On Cyclone II FPGAs [Cycl 04], these dedicated RAMs consist of 4-Kb memory blocks (M4Ks). Stratix IV FP- GAs [Stra 10] contain dedicated 9-Kb memory blocks (M9Ks) and 144-Kb memory blocks (M144Ks). The work by Cong [Cong 06c] proposed an HLS target memory architecture that uses FPGA block RAMs to implement a distributed set of register files as an alternative to discrete registers. This approach has the advantage of reducing register usage, minimizing multiplexing and improving clock frequency. They reported a 2X logic area reduction on average and a clock period improvement of 7.8%. The work in [Zhu 01] proposed an HLS temporal memory optimization to detect arrays within a procedure with non-overlapping lifetimes, which can then be stored in the same memory block. The approach uses a flow-sensitive pointer analysis combined with memory lifetime analysis and selects memories to share using a graph coloring algorithm. The work presented in [Pilat 11] is based on the PandA HLS framework and bears similarity to the local memory scheme we describe in Section 6.4. They propose a distributed memory architecture where each array is stored in a RAM local to the function where the array is defined. Instead of a shared memory controller, all distributed RAMs from other functions lower in the program call graph are connected together using a daisy-chain network. During a load/store from a function, they use this network to access memory stored in other functions by checking every distributed memory on the chain, one at a time, every cycle. The function stalls while waiting for the memory request. A possible drawback of this approach is that the daisy-chain network can require a long clock cycle latency for accessing memories from other functions, depending on the number of distributed memories. The C2H compiler targets a hybrid processor/accelerator architecture with a shared memory between the processor and accelerators [Santa 07]. They implemented memory accesses from accelerators to the processor memory as distinct master ports in the Avalon interconnect, allowing multiple memory accesses to be performed in parallel.

6.2.2 Alias and Points-to Analysis

A key challenge when we try to generate a suitable memory hierarchy in high-level synthesis is to reason about C memory pointers using only static compiler analysis. Alias analysis, or memory disambiguation, is the problem of determining when two pointers refer to overlapping memory locations. An alias occurs during program execution when two or more pointers refer to the same memory location. Points-to analysis, a closely related problem, determines which memory locations a pointer can reference. In this chapter, we will be more concerned with points-to analysis. Solving the alias and points-to analysis problems require us to know the values of all pointers at any state in the program, which makes this an undecidable problem in general [Landi 92]. Points-to analysis algorithms are categorized by flow-sensitivity and context-sensitivity. An approach is flow-sensitive if the control flow within the given procedure is used during analysis while a flow- Chapter 6. LegUp: Memory Architecture 83

1 const int imem[44] = { 0x8fa40000 , 0x27a50004, ... 2 char output[2048] = { 0 } ; 3 4 int main() { 5 int reg[32] , dmem[64];

Figure 6.1: C snippet showing an example of global and function-scoped memory variables.

insensitive approach ignores instruction execution order. Context-sensitive analysis considers the possible calling contexts of a procedure during analysis. Points-to analysis can either be confined to a single function, called intraprocedural, or applied to the whole program, called interprocedural. A survey of popular points-to analysis techniques is given in [Hind 00, Hind 01]. Points-to analysis algorithms have varying levels of accuracy and may be overly conservative, but for programs without dynamic memory, recursion, and function pointers, most pointers are resolvable at compile-time [Semer 98]. The compiler community has developed fast interprocedural flow-insensitive and context-insensitive algorithms. Andersen [Ander 94] described the most accurate of these approaches, which formulates the points-to analysis problem as a set of inclusion constraints for each program variable that are then solved iteratively. Steensgaard [Steen 96] presented a less accurate points-to analysis, which used a set of type constraints modeling program memory locations that can be solved in linear-time. In this chapter, we use the points-to analysis described by Hardekopf [Harde 07], which speeds up Andersen’s approach by detecting and removing cycles that can occur in the inclusion constraints graph. We could improve the accuracy of our points-to analysis by using a context-sensitive, flow-sensitive algorithm such as the symbolic pointer analysis described by Zhu [Zhu 02], which was shown to be scalable to larger programs by using binary decision diagrams [Zhu 04]. To aid pointer analysis, the C language now includes a pointer type qualifier keyword, restrict, allowing the user to assert that memory accesses by the pointer do not alias with any memory accesses by other pointers.

6.3 LegUp Memory Architecture

This section gives an overview of HLS memory challenges and describes the LegUp memory architecture. We focus here on targeting a pure hardware flow, without a processor. We also assume that LegUp does not allow dynamic memory (malloc/free), recursion or function pointers. During this chapter, we focus only on the on-chip memory architecture.

6.3.1 Overview

A traditional C compiler models memory as a single contiguous block of byte-addressable memory that is shared across all functions in the program. Global static variables are placed in a region of this shared memory and constants in a read-only section of memory. Function-scoped memory is stored in a region of memory called the stack, which grows after each function invocation and shrinks after each function return. Thus stack memory is re-used dynamically by different functions. As discussed in Chapter 3, LegUp is built on the LLVM compiler and operates on the LLVM inter- mediate representation. The C code shown in Figure 6.1 gives an example of global memory, constants, and stack memory with the corresponding representation in LLVM given by Figure 6.2. We have two Chapter 6. LegUp: Memory Architecture 84

1 @imem=constant [44 x i32] [i32 −1885077504, i32 665124868, ...] 2 @output = global [2048 x i8] zeroinitializer 3 4 define i32 @main() { 5 %reg=alloca [32xi32] 6 %dmem=alloca [64 x i32]

Figure 6.2: LLVM intermediate representation example showing global and stack memory.

Figure 6.3: HLS memory binding and memory interconnection network.

global variables: @imem is a constant integer array with 44 elements, @output is a global byte array holding 2048 elements. Local to the main function, we have two variables allocated on the stack: %reg is a 32 element integer array and %dmem is a 64 element integer array. Unlike a C compiler with a set target architecture, in HLS when we compile a C program into a digital circuit we are free to determine the best memory architecture for our specific application. Conceptually, generating a memory architecture in HLS requires us to determine the number of memory blocks and then connect them to the rest of the circuit datapath, as shown in Figure 6.3. Here we distinguish between programmer-defined C variables (reg, dmem, imem, output) called program memory and the physical memory consisting of block RAMs on the FPGA in the final synthesized circuit. First, we bind program memory to physical memory, using one of several approaches:

1. One global shared physical RAM that stores all program memory

2. One-to-one mapping: each program variable has a physical RAM.

3. A compromise, where multiple program memories can be assigned to multiple available physical memories.

The first binding approach is analogous to traditional computer architecture, with a single large con- tiguous shared memory. In LegUp, we use the second binding approach, where each C array is stored in a separate FPGA on-chip dual-port block RAM, with a data width that matches the data width of the array. Constant arrays are specifically instantiated as read-only memories to enable later FPGA synthesis optimizations. We show the third binding approach in Figure 6.3 where two program variables reg and dmem are placed in the same RAM, in non-overlapping regions. We explore this approach in Section 6.5. Chapter 6. LegUp: Memory Architecture 85

31 23 22 0

9−bit Tag 23−bit Address

Figure 6.4: LegUp 32-bit pointer address encoding.

In LegUp, we use a one-to-one mapping because: 1) the hardware implementation is simpler, 2) the circuit is easier to understand and debug with each array from the C program in a separate RAM rather than buried in a large RAM, and 3) in the future we could allow many distributed RAMs to be accessed in parallel, increasing memory bandwidth. As Figure 6.3 shows, we also must connect the physical memory to the circuit datapath, where store and load operations occur, using a particular interconnection network. Ideally, we would simply use point-to-point wires to connect each memory operation in the circuit datapath to an exact physical RAM. But if the C input program contains pointers that can point to multiple arrays then we will need multiplexing logic between the circuit datapath and the memory blocks. Furthermore, we will require accurate points-to analysis to determine exactly which arrays each load or store instruction can access at compile time. In LegUp, we partition all program memory into the following categories: global memory blocks accessed through a shared memory controller, or local memory blocks instantiated in a particular hard- ware module. We use local memory when we can statically determine that two conditions are met: 1) the array is only accessed in one function, and 2) each pointer only points to a single local array (no multiplexing is required). All other program memory is stored in global memory blocks. LegUp has no semi-local memory blocks for pointers that can point to multiple arrays, these arrays are all placed in global memory (this is future work). In the case of the hybrid flow, we have a third memory category: processor memory allocated by the processor and accessed by the hardware accelerators using the on- chip memory cache in Figure 3.2. We will discuss global memory in Section 6.3.2 and local memory in Section 6.4. The details of accessing processor memory are beyond the scope of this dissertation.

6.3.2 Global Memory Blocks

All global memory blocks in LegUp’s target memory architecture are accessed through a shared memory controller. The shared memory controller makes global memory accesses easy to understand and reason about, and reduces memory signals passed between hardware modules. We assign a unique number called a tag to each program variable and associated physical memory, which we use for steering logic in the memory controller. All LegUp addresses are 32 bits wide and are composed of the array tag and the array address as shown in Figure 6.4. The upper 9 bits of the memory addresses are reserved for a tag bit, allowing 255 distinct C arrays. A tag value of zero is reserved for NULL pointers and a tag value of one is reserved for the processor memory address space. The 24-bit address allows up to a 16MB byte-addressable memory for each array. Because the lower bits are used for the array address, this scheme allows pointer arithmetic—incrementing the address will not affect the tag bits. We could increase the pointer size to 64 bits in the future if we need more addressable memory space. Continuing the example C code shown in Figure 6.1, we assume that LegUp places each of these arrays in global memory: imem, output, reg, dmem. LegUp will instantiate one 44-word 32-bit ROM for the constant imem, a 2048-word 8-bit RAM for output, a 32-word 32-bit RAM for reg, and a 64-word Chapter 6. LegUp: Memory Architecture 86

imem mem_clk clk mem_data_in 8 mem_addr >> 2 addr dataout 10 mem_data_out 23 32 mem_write_en 11 32 32 mem_enable en ROM mem_tag 9 = output 2 8 clk datain addr dataout we en RAM

= 3 prev_tag

Figure 6.5: LegUp memory controller block diagram.

32-bit RAM for dmem. As an example, we could assign the unique tags 2, 3, 4, 5 to imem, output, reg, and dmem respectively. Given these 9-bit tag assignments, the (byte) address of imem[10] would be (01000028)16, the address of output[15] would be (0180000F )16, and the address of dmem[5] would be (02800014)16. Figure 6.5 shows the LegUp memory controller block diagram for two of these integer arrays: imem and output. Extending this controller to access all four arrays is straightforward. In the figure, we show the FPGA block ROM for the constant 32-bit imem array and the block RAM for 8-bit output array, which are both instances of Altera’s ALTSYNCRAM memory megafunction (inferred from Verilog). LegUp automatically generates a memory initialization file for each memory block. Here the memory controller checks the 9-bit tag (mem tag) of the incoming memory address to determine which memory block to enable and disables all the other RAMs. All addresses accessing the integer array imem will be aligned to 4-byte word boundaries, therefore the bottom two address bits will be zero. We account for this by right shifting the incoming 23-bit array address by two and passing the resulting 21-bit array index into the address port of the imem ROM block. The output is an 8-bit array so no address shifting is needed. We assume that the latency of the FPGA block RAMs is one cycle, therefore we must use the previous tag to select the memory block that is outputting the data requested in the previous cycle. We register the output of the memory controller to improve the circuit clock frequency, as the steering output multiplexer can become large. Consequently, all load/stores in LegUp have a two cycle latency by default. If the tag equals zero (NULL) then we ignore the memory access. If the tag equals one (processor memory) then we redirect the memory request to the on-chip memory cache over the Avalon interconnect (Figure 3.2), which we do not shown here. The memory controller output width is equal to the maximum data width of any global memory block. LegUp can support generic pointers, or pointers that can point to any array, by using the pointer tags bits in the memory controller to resolve pointer ambiguity during circuit runtime. This allows a wider range of input C programs, particularly those which are not amenable to points-to analysis at compile time. For the hybrid flow, we can use the tag bits to determine when a pointer should access Chapter 6. LegUp: Memory Architecture 87 the processor memory. For simplicity, we have been assuming that the memory controller is single-ported, allowing one memory load or store per cycle. To add another memory port to the controller, we duplicate all the previously described input and output signals for the second port. We instantiate dual-ported RAMs that are available in the FPGA fabric for each array and connect each port of the memory controller to the corresponding port on the RAMs. We also duplicate the output multiplexer and previous tag register for the second output port of the memory controller. With these modifications to the memory controller, we can support two global memory accesses from the circuit datapath every cycle. During HLS scheduling we enforce this constraint by only scheduling up to two load or store instructions in any cycle. During HLS binding, we connect memory accesses in the circuit datapath to one of the available memory controller ports during that cycle. A limitation of LegUp’s memory controller is that we can only access array elements; the memory contents are not completely byte-addressable. For instance, we could not directly load the most signifi- cant byte of an integer stored in an array. We could allow this in the future by adding additional input and output multiplexing to steer individual bytes. Multi-dimensional arrays are handled like single-dimensional arrays, with elements stored in row- major order, the same convention used by C. LegUp uses a different memory controller to handle C structs (implemented by Victor Zhang). In a struct, the individual elements can have non-uniform size. We handle this with an additional 2-bit mem size input port to the memory controller, which indicates the size of the struct element we are accessing: 0 for an 8-bit element, 1 for 16 bits, 2 for 32 bits, or 3 for 64 bits. We instantiate a 64-bit wide block RAM for each struct. When writing to a struct element that is smaller than 64 bits, we must use the mem addr and mem size to activate the appropriate input byte enables of the RAM. When reading a struct element that is smaller than 64 bits, we must use the mem addr and mem size to steer the correct bits of the 64-bit struct memory output to the lowermost bits of mem data out using a multiplexer. The full details of this memory controller are outside the scope of this dissertation. In Figure 6.6, we continue our previous example from Figure 6.1 and show the memory controller steering logic when loading array element output[13]. At the top of the figure, we show the pointer to output[13] with a tag equal to three and an array address of 13. The address 13 is fed into the address port of the output RAM and the imem ROM (after right shifting by two). The memory controller write enable is equal to zero (load) and the enable bit is true. The tag is checked to enable the output RAM and disable the ROM holding imem. Using the memory controller output multiplexer, we select the output of the output RAM. There are a few performance issues with the described memory controller. First, it contains a wide multiplexer, which grows linearly with the number of arrays declared in the C code, assuming that all arrays are placed in global memory. We will mitigate this issue by placing arrays in local memory blocks as described in Section 6.4. Second, the memory controller allows only two memory accesses per clock cycle, which can be a performance bottleneck if we have significant parallelism in the program. In particular, loop pipelining is severely impacted by this memory constraint, so distinct local RAMs should be used if possible. Instruction level parallelism is also limited by this memory constraint. Hypothetically, since program variables are stored in separate physical memory blocks, we could allow more than two memory accesses per cycle. But adding more ports to the global memory controller would require additional multiplexing. Furthermore, during HLS scheduling we would need to ensure Chapter 6. LegUp: Memory Architecture 88

31 23 22 0 Tag = 3 Address = 13 imem 23 mem_clk clk 0 8 >> 2 addr dataout 10 output[13] 32 0 11 32 32 1 en ROM 0 9 = output 2 8 clk datain addr dataout we en RAM 1 = 3 prev_tag

Figure 6.6: LegUp shared memory controller when loading array element output[13].

main main a b a b c d c

c d c (a) Program call graph. (b) Hardware module instantiation hierarchy.

Figure 6.7: Relationship between program call graph and hardware module instantiations. that during any particular cycle, we do not access a program variable more than twice (each RAM is only dual-port). This would require either accurate points-to analysis or stalling logic in the memory controller to detect and handle conflicts. Also the benefit would primarily depend on the increase in instruction level parallelism enabled by allowing more memory accesses per cycle. We leave this for future work. Another performance limitation of global memory relates to the program call graph. In LegUp, each C function corresponds to a hardware module in the final synthesized circuit. The hardware module instantiation hierarchy is dependent on the program call graph, with modules corresponding to functions lower in the call graph of the program being instantiated deeper in the module hierarchy of the circuit. For instance, assume we have a program with the call graph shown by Figure 6.7(a), where the main function calls two other functions: a and b. In LegUp, the hardware modules will be instantiated according to the hierarchy shown in 6.7(b), where the modules corresponding to a and b have been instantiated inside the main module. The module c is instantiated twice, because the corresponding function is called from two distinct functions, a and b, in the program. Since we do not allow recursion, the program call graph will always be a tree. Chapter 6. LegUp: Memory Architecture 89

main Memory Controller a Datapath Datapath memory c b address Datapath Datapath

c d Datapath Datapath

Figure 6.8: Multiplexing required for the memory address at each level of the module hierarchy.

In the pure hardware flow, the shared memory controller is always instantiated in the main hardware module. We further assume that while the circuit is operating, only one hardware module is active at any one time. Figure 6.8 shows the multiplexing required for the memory address at each level of hierarchy for the example already described in Figure 6.7. For example, if we read from an array in the datapath of hardware module d then we must pass the array address up to module a, which then must pass this address up to the main module, where the memory controller is instantiated. Module main contains a 32-bit wide 3-to-1 memory address multiplexer, to handle the three possibilities: either the datapath in main, a, or b is active. Modules a and b contain a 3-to-1 and 2-to-1 32-bit multiplexer respectively. As we get further down in the module hierarchy, there is more multiplexing to get to the memory controller. For a program with a deep call graph, this multiplexing can be detrimental to circuit clock frequency.

6.4 Local Memory Blocks

In this section, we describe how LegUp stores some program arrays in distinct RAMs, called local memories, that are instantiated locally within a particular hardware module. These local memories can alleviate some of the performance drawbacks of global memories that we discussed in the previous section. For local memories of a function, we instantiate a physical RAM within the module corresponding to the function, and connect any memory accesses that refer to the memory location directly to the corresponding local RAM. Global-scoped variables in the C program may still be categorized as local memory if they are only used in one function. All constant arrays are categorized as local memory, which we handle in LegUp by duplicating a local ROM inside all functions that access the constant. We continue our example from Figure 6.1 and assume that the arrays reg and dmem are partitioned into local memory and output is stored in global memory (we ignore array imem). We show the synthe- sized datapath of a function that accesses these arrays in Figure 6.9. In the figure, we only show the memory address wires for simplicity. On the right side of the block diagram, we have shown two local FPGA block RAMs containing the reg and dmem arrays. Blocks in the leftmost column of the figure Chapter 6. LegUp: Memory Architecture 90

From Finite State Machine Function Datapath 2 Load reg [2] addr reg 6 RAM Load reg [6] 7 Store reg[7] 3 Store dmem[3] addr dmem 5 RAM Load dmem[5] To Global 0180000816 Memory Store output[8] address Controller

0180000916 Store output[9]

Figure 6.9: Local and global memory addressing logic within the hardware module datapath denote locations in the datapath that either load or store array elements. For simplicity, we assume that the exact memory accesses are known at compile time, for instance the first load accesses reg[2]. In general, we may not know the element until runtime. For example, if the first load had accessed the ele- ment reg[i] then the datapath will be the same but with 2 replaced with wire i. The first three accesses in the figure load and store from the reg array. Therefore, we need a 3-to-1 multiplexer to determine the memory address of the local reg block RAM. The select line of the multiplexer is controlled by the circuit’s finite state machine, which compares the current state to the scheduled state of the memory access. Arrays in local memory do not need tags; the address of reg[2] is simply two. There are only two accesses to the dmem array in the datapath, so we only need a 2-to-1 multiplexer in front of the dmem RAM address port. We also note that the datapath could access both the dmem and the reg block RAMs in parallel. The output array was assigned to global memory. Therefore, we have a tag assigned to output which we assume is equal to three. Given this tag assignment, the address of output[8] and output[9] would be (01800008)16 and (01800009)16 respectively. The size of the memory address port multiplexer scales linearly with the number of memory accesses to that array in the function. For global memories, the multiplexer scales with the number of accesses to any global array, which may include several arrays in the function. If we have many memory accesses in a function then these multiplexers can be on the critical path. In the future, we could explore pipelining these multiplexers. Local memory reduces the number of physical memories accessed by the shared memory controller, which shrinks the size of the output multiplexer in Figure 6.5. Accesses to local memory require an address bitwidth that depends on the size of the local RAM, in contrast to global memories which all require 32-bit addresses passed to the memory controller. Local RAMs have no output multiplexing or output register, therefore they have only one cycle of latency compared to the two cycle latency required by the shared memory controller. Lower latency leads to reduced cycle execution time for local memory accesses. During FPGA placement, these local RAMs can be placed physically closer to the datapath of operations, which can lower delays due to closer proximity on the FPGA device. A key motivation for local memories is they can improve performance by allowing arrays to be Chapter 6. LegUp: Memory Architecture 91 accessed in parallel, which increases memory bandwidth. We already showed some of the benefits of local memories in Chapter 5, where loop pipelining required local memories to achieve good performance. The loop pipelining experimental study presented in Section 5.6, used local memories within the benchmarks to allow independent memory blocks to be accessed in parallel, achieving higher pipeline throughput.

Algorithm 3 MemoryPartition() 1: pointsToSet ← points-to set found by points-to analysis of the program 2: for each load/store instruction loadstore in the program do 3: address ← array address accessed by loadstore 4: function ← function containing loadstore 5: memories ← set of arrays pointed-to by address in pointsToSet 6: for each array in memories do 7: continue if array is already marked global 8: if array is constant then 9: Mark array as local to function 10: else if number of arrays in memories > 1 then 11: Mark array as global 12: else if array is already marked local to any function 6= function then 13: Mark array as global 14: else 15: Mark array as local to function 16: end if 17: end for 18: end for

We now describe Legup’s partitioning algorithm that decides whether each program array is placed in local or global memory. We show the pseudocode in Algorithm 3. On line 1, we use a points-to analysis [Harde 07] implemented in LLVM by Silva [Silva 13], which we enhanced to perform accurately for LegUp’s benchmarks. This returns a points-to set that contains the set of all memory locations pointed to by each address variable in the program. We treat a C array as one memory location, assuming that individual array elements are not distinguished. On lines 2–19, we loop over all LLVM load and store instructions in the program and retrieve the memory address accessed by the instruction (line 3). We retrieve the set of all arrays that could be accessed by the current instruction by looking up the address in the points-to set (line 5). We then loop over every array that this instruction can point to, categorizing them as local or global (lines 6–17). During this algorithm, an array can either be marked as global or local memory. Marking an array as global memory overrides any prior assignment to local memory (line 7). Each local array must used exclusively in one function unless the array is a constant, in which case we mark the constant array local to all functions where the array is accessed (lines 8–9). We only allow pointers to access local arrays if they only point to a single local array. Therefore, we detect when pointers can point to multiple arrays, and mark all of these arrays global (lines 10–11). If the array was used in more than one function, we mark the array as global memory on lines 12–13. Finally, in all other cases we mark the array as local memory to this particular function on line 15.

6.5 Grouped Memories

LegUp’s existing memory architecture suffers from poor utilization of Stratix IV M9K memory blocks by global memory. This is because an Altera synchronous on-chip RAM instantiated with only a few words Chapter 6. LegUp: Memory Architecture 92

Figure 6.10: LegUp allocating one physical RAM for each array.

Figure 6.11: Grouping arrays into physical RAMs in LegUp’s shared memory controller.

occupied will be synthesized to use an entire M9K block on the FPGA, or effectively 9Kb of memory. For example, Figure 6.10 shows the LegUp memory controller for a program with three 32-bit arrays: A, B, and C. In the figure, we leave out irrelevant details of the memory controller shown in Figure 6.5. Each array has been allocated to a separate physical block RAM, which will require three distinct M9K memory blocks on the FPGA device. However, the arrays have only used a total of 192 bits, which should only require one M9K memory block.

To better utilize FPGA block memory, LegUp can group arrays by bitwidth and store them in one large physical RAM for each bitwidth size. We can achieve significant M9K savings by packing many memories into the same M9K block using this grouped memories approach, as shown in Figure 6.11. Here, we have grouped all three 32-bit arrays from Figure 6.10 together inside a single RAM. We also show that an array with a different bitwidth, such as the 16-bit array Z, is stored in a different RAM. By grouping the 32-bit arrays, we have shrunk the number of M9K blocks from three to one. Furthermore, the multiplexer on the output has less inputs, which could improve the circuit clock frequency. The number of memory bits required has remained unchanged but the number of memory implementation bits corresponding to M9K block usage has improved substantially.

We group all arrays in the memory controller in up to four RAMs, one for each possible array bitwidth: 8, 16, 32, 64. LegUp only groups global memory blocks, not local memories. We group constant memories and non-constants memories in ROMs and RAMs respectively Chapter 6. LegUp: Memory Architecture 93

(a) Naive array offsets with minimal (b) Pad the RAM to make both array’s wasted space. offset divisible by three.

Figure 6.12: Grouped memory array address offsets.

6.5.1 Grouped Memory Allocation

All program memory grouped within a physical RAM share the same tag. But now the address of each grouped array must include a byte offset, to account for the location of the array within the larger RAM. Therefore, a pointer address now consists of: Tag + Offset + Index, where Tag and Offset are usually known at compile time but Index typically changes at runtime. Unfortunately, this offset will require us to use more addition operations during address calculations in hardware compared to the non-grouped approach. As we saw previously, by default an array’s address is calculated by Tag + Index. We know that the constant Tag has no overlapping bits with the Index, so we can concatenate the two values without any addition. When grouping memory, if we pack each array directly after the previous array in the grouped RAM, the Offset will typically have overlapping bits with the Index, so an addition must be used. For example, Figure 6.12(a) shows the default grouping of two arrays, A and B, in a single RAM. The array B has an Offset of three, therefore a pointer to B[1] would involve the address calculation: Offset + Index = (11)2 + (01)2, requiring an addition. These extra additions after grouping add delay to the datapath, increasing the circuit clock frequency and circuit area, effectively negating the improvement in M9K blocks. Given an array A, to access the element A[i] from the circuit datapath we would perform the address calculation: Tag A +Offset A + i. However, if we can align the array offset such that Offset A does not overlap with the range of i indices for array A, then we can transform this address calculation to use OR gates instead of addition: Tag A OR Offset A OR i. Furthermore, if Tag A and Offset A are known at compile time, then we can calculate the address using simple concatenation without any hardware logic. If we can ensure that all arrays follow this alignment property, then we can always perform address calculation without the need to perform addition. We note that this technique will only work in LegUp’s pure hardware flow because in the hybrid flow we have no control over the alignment of arrays placed in processor memory. We need to ensure that for each grouped array, the array address Offset is large enough so that no bits overlap with any possible values of the array’s Index. The bitwidth of Index is determined by the number of elements in the array, N. We increase the Offset until it is a multiple of 2N (i.e., the bottom N bits equal zero). For example, in Figure 6.12(a) the possible values of Index for array B range from zero to two, requiring two bits. Therefore, we should align the address of the array such that the Offset is a multiple of four (22), as shown in Figure 6.12(b). A pointer to B[1] now involves the address Chapter 6. LegUp: Memory Architecture 94

Table 6.1: Naive grouped RAM memory allocation Array Start Address (B) End Address (B) Memory Size (B) Alignment (B) a 0 (0)16 2 (2)16 3 (3)16 4 (4)16 3 (3)16 4 (4)16 1 (1)16 b 4 (4)16 6 (6)16 3 (3)16 4 (4)16 7 (7)16 511 (1F F )16 505 (1F 9)16 c 512 (200)16 799 (31F )16 288 (120)16 512 (200)16 800 (320)16 4095 (F F F )16 3296 (CE0)16 d 4096 (1000)16 6151 (1807)16 2056 (808)16 4096 (1000)16 6152 (1808)16 6655 (19F F )16 504 (1F 8)16 e 6656 (1A00)16 6943 (1B1F )16 288 (120)16 512 (200)16 6944 (1B20)16 8191 (1F F F )16 1248 (4E0)16 f 8192 (2000)16 10247 (2807)16 2056 (808)16 4096 (1000)16 10248 (2808)16 11263 (2BF F )16 1016 (3F 8)16 g 11264 (2C00)16 12287 (2F F F )16 1024 (400)16 1024 (400)16 h 12288 (3000)16 12291 (3003)16 4 (4)16 4 (4)16

calculation: Offset OR Index = (100)2 OR (001)2 = (101)2, which can be performed as a concatenation. Modifying the array offsets using this technique saves area (fewer adders) and improves circuit clock frequency. Of course we waste memory bits, for instance in Figure 6.12(b) the fourth word is empty, but we waste less memory bits than when we do not group global memory. To provide a concrete example of grouping arrays into a single RAM, consider eight 32-bit arrays (a–h) from the jpeg CHStone benchmark varying from 3–2056B in size. We first use a naive memory allocation for placing each array into the same 32-bit RAM. We add each array to the RAM in program order (a–h) and we place each array at the next available spot in memory that satisfies the array’s alignment constraint. We show the final memory allocation in Table 6.1. The first column gives the name of each array, the second and third give the start address of each array in decimal and in hexadecimal as calculated by our naive approach. The next two columns give the ending address of each array. The next two columns give the array size in bytes. The final two columns show the required address alignment of the array, as explained previously. Rows that represent unused memory “holes” in the RAM are marked in gray, with the size column showing the amount of unused space. For example, array c is offset to byte-address (200)16 in the RAM, which has no bits overlapping with the possible array indices: (0)16–(11F )16. After this memory allocation, the total required size of the RAM is 12,292B with 6,570B of unused space resulting in a fragmentation ratio (unused/total memory) of 0.53. Minimizing memory fragmentation requires us to develop a static memory allocator. Our algorithm will differ from conventional memory allocation techniques [Wilso 95] because each memory must be allocated to a specific range of address boundaries. Algorithm 4 shows the pseudocode for our memory allocation. Our approach reduces fragmentation by reordering the arrays in the RAM and by keeping track of unused holes of memory. We observe that arrays with larger address alignment requirements have fewer choices of valid offsets in the RAM. Therefore, on line 1 we sort the arrays by descending alignment size first, and by descending array size to break ties. We keep a list of unused holes in memory on line 2. Initially the RAM is empty, recall from Figure 6.4 that we have 23 address bits for each RAM, therefore the first hole spans the addresses from 0 to 223 − 1 (line 3). On lines 4–30, we loop over the sorted arrays and greedily place each array in the first available space (hole) in the RAM. The boolean variable placed on line 5 will be true when we have found a spot in the RAM for the array. We keep Chapter 6. LegUp: Memory Architecture 95

Algorithm 4 MemoryAllocation(arrays) 1: Sort arrays in descending order by address alignment then descending by array size 2: holes ← empty list 23 3: Insert hole from addresses 0 to 2 − 1 into holes list 4: for each array in sorted order do 5: placed ← false 6: hole ← first hole from holes list 7: arrayStart ← 0 8: while not placed do 9: while hole start address > arrayStart do 10: arrayStart ← arrayStart + array.alignment 11: end while 12: arrayEnd ← arrayStart + array.size − 1 13: if arrayEnd ≤ hole end address then 14: allocate array to start at address arrayStart in memory 15: placed ← true 16: start1 ← hole start address 17: end1 ← arrayStart − 1 18: start2 ← end1+1+ array.size 19: end2 ← hole end address 20: if start1 ≤ end1 then 21: Insert new hole from addresses start1 to end1 in holes list before hole 22: end if 23: if start2 ≤ end2 then 24: Insert new hole from addresses start2 to end2 in holes list after hole 25: end if 26: Remove hole from holes list 27: end if 28: hole ← next hole in holes list 29: end while 30: end for Chapter 6. LegUp: Memory Architecture 96

Table 6.2: Grouped RAM memory allocation with reduced fragmentation Array Start Address (B) End Address (B) Memory Size (B) Alignment (B) d 0 (0)16 2055 (807)16 2056 (808)16 4096 (1000)16 h 2056 (808)16 2059 (80B)16 4 (4)16 4 (4)16 2060 (80C)16 2559 (9F F )16 500 (1F 4)16 c 2560 (A00)16 2847 (B1F )16 288 (120)16 512 (200)16 a 2848 (B20)16 2850 (B22)16 3 (3)16 4 (4)16 2051 (B23)16 2551 (B23)16 1 (1)16 b 2852 (B24)16 2854 (B26)16 3 (3)16 4 (4)16 2855 (B27)16 3071 (BF F )16 217 (D9)16 g 3072 (C00)16 4095 (F F F )16 1024 (400)16 1024 (400)16 f 4096 (1000)16 6151 (1807)16 2056 (808)16 4096 (1000)16 6152 (1808)16 6655 (19F F )16 504 (1F 8)16 e 6656 (1A00)16 6943 (1B1F )16 288 (120)16 512 (200)16 track of the candidate starting address for the array on line 7. We now start looping over the available holes on lines 8–29, starting from first available hole in the RAM (line 6). Recall that if an array has an alignment of 4096 then the array’s addresses must be a multiple of 4096 (i.e. 0, 4096, 8192, 12288). In the loop on lines 9–11, we increase the candidate array start address by multiples of the alignment until we are after the start of the currently available hole. The array end address is equal to the array start address plus the array size in bytes (line 12). On line 13, we check if we can fit the array into the current hole. If the array fits, we have successfully allocated the array at this start address (line 14) and we can update the placed variable (line 15). Now we must update the current hole in memory, which has possibly become two new holes on either side of the array we just placed (lines 16–26). We calculate the new start and end addresses of the new holes (lines 16–19). We add the two new holes on lines 21 and 24 after checking that there was unused space to form a hole (lines 20, 23). We remove the original outdated hole from the holes list on line 26. If there was not enough room for the array in the current hole, then we move on to the next available hole on line 28 and iterate, otherwise we move on to the next array. Our algorithm is O(n2) where n is the number of arrays to be allocated in the RAM. Using this algorithm on the arrays we saw in Table 6.1, we present the new memory allocation shown in Table 6.2. In this example, we first ordered the arrays by descending alignment and size: d,f,g,c,e,h,a,b. We start by allocating the first array d into the first available location at address zero. Then we allocate array f, which is placed at the next available unused address at 4096, a multiple of the required 4096 alignment. The g array can be placed at addresses 0, 1024, 2048, 3072 but the first three are taken up already by array d, so we place g at the first available memory slot at address 3072. We continue this process for the remaining arrays. After memory allocation, the total size of the ram is now 6,944B (45% less than before) with 1,222B of unused space, leading to a significantly better fragmentation ratio of 0.18.

6.6 Experimental Study

We studied the impact of our proposed memory optimizations using the CHStone benchmarks. We targeted the Stratix IV [Stra 10] FPGA (EP4SGX530KH40C2) on Altera’s DE4 board [DE4 10] using Quartus II 13.1 to obtain area and FMax metrics. Quartus timing constraints were configured to optimize for the highest achievable clock frequency. Our experimental results were performed across the Chapter 6. LegUp: Memory Architecture 97

CHStone benchmark suite. We considered four scenarios for comparison: 1) placing all program arrays in global memory (Global); 2) grouping multiple arrays into the same RAM in global memory (Group); 3) partitioning arrays into local memory and global memory (Local); and, 4) combining grouped global memory with local memories (Both). Table 6.3 gives speed performance results for these four scenarios. The “Cycles” column is the total number of cycles required to complete the benchmark. The “FMax” column provides the FMax of the circuit given by the Quartus. The “Time” column gives the circuit wall-clock time: Cycles · (1/FMax). Ratios in the table compare the geometric mean (geomean) of the column to the respective geomean in the default LegUp flow. For local memories, we used static points-to analysis [Harde 07] across the CHStone benchmarks to identify the set of possible memory locations accessed by each pointer during program execution. For these benchmarks, all pointers could be resolved to one or more memory locations. Of all the pointers, 94% pointed to a single memory location, 3% pointed to two memory locations, and 2% pointed to four memory locations. Of all the 140 arrays across the benchmarks, 16% were referenced by a pointer that could point to more than one memory location, causing these arrays to be placed in global memory. When we partitioned the arrays into local and global memory across these benchmarks, we were able to place 57% of arrays in local memory and 43% in global memory. We were unable to place 27% of the arrays in local memory due to the array being accessed by multiple functions. For this analysis, we ignored unsynthesizable memory such as the constant character arrays used as arguments for the printf function. We had a choice of either one or two cycles of latency for the local memories. We chose single-cycle latency for local memories, which resulted in a geomean clock cycle reduction of 9% compared to the two-cycle latency of global memory. However, using single-cycle local memories did not improve FMax, due to local block RAMs only having an input register. Without a RAM output register, our datapath contained more combinational delay when loading from a local memory compared to the shared memory controller used for global memory. We also ran an experiment with two-cycle latency local memories that resulted in a 10% higher geomean FMax and a comparable wall-clock time improvement but with a geomean cycle count within 1% of using only global memory. This result implies that the HLS schedule (instruction level parallelism) of these benchmarks was not significantly constrained by having only two global shared memory ports. Overall geomean wall-clock time performance was improved by 12% by combining local memories with grouping global memories, with a portion of the overall improvement coming from the local memories and a portion from grouping global memories. Table 6.4 gives the area results for the four scenarios. The “Memory Implementation Bits” column gives the number of Stratix IV M9K blocks required multiplied by 9Kb, added to the number of M144K blocks multiplied 144Kb. We use “implementation” to differentiate this metric from the number of memory bits required by the circuit without consideration for actual FPGA memory block usage. The “Logic Utilization” column provides the logic utilization reported by Quartus II, which is a metric for measuring device area by estimating the number of half-ALMs used by the circuit. The “Registers” column gives the number of Stratix IV dedicated registers required. We found a significant improvement in the memory implementation bits required during synthesis, which decreased by 27% by grouping global memories. Furthermore, using local RAMs in isolation also improves memory implementation bits by 22% because the FPGA synthesis tool is able to optimize the smaller RAMs away (implementing Chapter 6. LegUp: Memory Architecture 98 68 3.2 8.9 1.0 87.9 61.7 25.1 25.9 43.3 0.88 735.1 656.6 373.4 Both 11,161.1 71 3.2 8.9 1.0 94.1 74.8 25.1 25.9 44.9 0.92 795.8 689.1 393.2 s) µ Local 13,284.0 76 3.3 9.2 1.1 86.5 72.5 34.5 26.4 47.8 0.99 Time ( 351.7 1,018.1 1,002.0 Group 12,009.7 77 2.9 8.3 1.1 90.3 83.5 38.4 26.0 45.9 1.00 364.7 1,023.9 1,041.9 Global 12,946.9 223 107 254 151 149 211 215 228 159 190 195 191 185 1.03 Both 98 141 123 206 211 215 228 151 190 195 184 242 176 0.98 Local 167 129 178 234 213 252 169 170 101 236 173 209 180 1.00 Group FMax (MHz) 160 112 177 265 237 259 163 153 102 240 180 201 180 1.00 Global 676 234 0.90 9,196 1,918 4,774 5,044 8,264 13,274 59,366 12,481 Both 163,928 166,768 1,194,242 676 234 0.91 9,196 1,918 4,774 5,044 8,264 Table 6.3: Memory architecture performance results 13,274 59,366 12,571 163,928 166,768 Local 1,301,836 Cycles 776 274 0.99 9,348 1,962 5,868 6,234 8,266 14,444 59,438 13,773 181,228 209,414 Group 1,212,984 776 274 1.00 9,348 1,962 5,868 6,234 8,266 14,444 59,438 13,871 181,228 209,414 Global 1,320,580 Benchmark adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha Geomean Ratio Chapter 6. LegUp: Memory Architecture 99 923 0.87 9,589 1,070 9,051 3,939 2,044 5,846 4,561 7,835 2,478 4,581 Both 13,005 18,395 923 0.88 9,608 1,070 9,131 3,944 2,044 5,846 4,561 7,859 2,488 4,608 Local 13,004 19,379 0.99 Registers 9,968 9,330 4,066 2,955 8,261 1,370 4,616 1,082 7,856 3,247 5,211 12,998 19,130 Group 1.00 9,525 4,092 2,679 8,305 1,393 4,644 1,086 7,904 3,294 5,250 10,192 13,052 20,785 Global 0.80 4,952 3,485 6,632 2,091 6,994 2,173 2,754 6,743 Both 11,036 11,359 18,616 31,854 11,872 0.82 5,141 3,485 6,632 2,091 6,994 2,173 3,122 6,918 Local 11,396 11,693 18,590 34,136 12,051 0.96 5,422 6,074 9,327 2,848 7,837 2,607 3,791 8,019 12,148 11,800 18,810 33,897 11,410 Group 1.00 5,726 5,754 9,421 2,906 8,117 2,697 4,327 8,395 13,509 12,673 18,734 38,200 11,709 Logic Utilization (half-ALMs) Global 0.63 9,216 9,216 9,216 64,512 92,160 18,432 55,296 36,864 55,296 Both 51,190 285,696 635,904 184,320 Table 6.4: Memory architecture area results 0.78 9,216 9,216 9,216 18,432 55,296 36,864 92,160 Local 63,665 193,536 110,592 285,696 967,680 276,480 0.73 55,296 73,728 64,512 18,432 18,432 18,432 27,648 18,432 73,728 59,757 Group 267,264 755,712 165,888 1.00 Memory Implementation Bits 36,864 18,432 18,432 18,432 64,512 36,864 92,160 81,846 202,752 110,592 294,912 976,896 276,480 Global Benchmark adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha Geomean Ratio Chapter 6. LegUp: Memory Architecture 100 them in LUT RAM). When we combined local memories with grouping global memories, we found that the reduction in memory implementation bits was 37% on average. By combining local memories with grouping, we found geomean logic utilization decreased by 20% and geomean registers decreased by 13%. This was due to less multiplexing in the shared memory controller and also datapath registers that can be packed by Quartus into unused registers inside the connected local block RAMs.

6.7 Summary

This chapter presented the memory architecture generated by LegUp. We discussed how global memories are accessed through a shared memory controller. We also described how LegUp uses separate local memories for arrays that are only accessed in one particular function. These local memories can offer greater performance than global memories alone. We also described how to group arrays together in global memories instead of storing them in separate RAMs, which reduces FPGA block RAM usage. We found the geomean memory implementation bits required during synthesis decreased by 37% using local memories and grouped memories when compared to using only global memories. We also found that combining the two memory approaches improved geomean wall-clock time performance by 12% over the CHStone benchmark suite. Chapter 7

Case Study: LegUp vs Hardware Designed by Hand

7.1 Introduction

A major goal of this dissertation is to improve the quality of hardware that can be synthesized automat- ically from software. In this chapter, we attempt to answer the question: how close is LegUp-generated hardware to hand-designed hardware? Or perhaps more importantly, can our proposed methodology generate circuits that can meet realistic FPGA design constraints? As a first step to answering this question, we present a case study of a Sobel image filter. This filter is typically used in edge detection, which is important for computer vision applications. We will first describe the Sobel algorithm and present a straight-forward C implementation. We then describe the hand-written hardware implementation of the filter. Next, we provide an implemenation using LegUp, showing the transformations we made on the C code in order to match the performance of the RTL implementation. We show that LegUp can produce a filter with a wall-clock time within 2% of the custom implementation, but with about 65% more circuit area. This case study illustrates the types of code transformations we must currently apply for high-level synthesis to create a circuit of peak performance. Some of these transformations are non-obvious to a software engineer, which we hope will motivate future work. The remainder of this chapter is organized as follows: Section 7.2 presents related work and provides a description of the Sobel filter. Section 7.3 describes the hand-written RTL implementation of the filter. We present an equivalent LegUp implementation in Section 7.4. An experimental study comparing the two approaches is presented in Section 7.5. Section 7.6 offers a summary.

7.2 Background

7.2.1 HLS vs Hand RTL

There have been a few published studies measuring the gap between HLS and custom hand-written RTL designs. An independent study by BDTI [BDTI] found that AutoESL [Auto] (now called Vivado [Xili]) produced a design that met the throughput requirements for a DQPSK receiver and had a level of

101 Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 102                                  a b c           d e f             g h i                                             Figure 7.1: Sobel stencil sliding over input image

Table 7.1: Sobel Gradient Masks. Gx Gy -1 0 1 1 2 1 -2 0 2 0 0 0 -1 0 1 -1 -2 -1 resource utilization comparable to hand-coded RTL on a Spartan-3A Xilinx FPGA device. Similar work in [Nogue 11] implemented a DSP wireless receiver sphere decoder channel preprocessor using AutoESL. They found that the high-level synthesis implementation was competitive to the reference RTL implementation in terms of throughput and resource cost. The high-level synthesis Blue Book [Finge 10] discusses the style of C coding required to achieve acceptable performance when synthesizing hardware. Fingeroff shows examples for the Catapult C [Caly] HLS tool but are equally applicable to other HLS tools, including LegUp. He stresses that a poor C-level description can lead to a sub-optimal final circuit.

7.2.2 Sobel Filter

The Canny edge detector [Canny 86] is an algorithm to detect edges in an image, which are important for computer vision applications. In an efficient hardware implementation, we operate on a “stream” of incoming data, where one new pixel from the image arrives every clock cycle. The edge detection algorithm consists of five stages, each of which can be run in parallel, with the output pixel from one stage feeding into the next stage at every clock cycle. The first stage of Canny edge detection applies a low-pass image filter, using a Gaussian convolution, to blur the image and remove noise that could cause superfluous edges. Next, we use an edge detection operator to compute the edge direction. Edge detection operators work by approximating the horizontal and vertical first derivatives of the intensity of the image in a particular window of the image. These derivatives indicate the direction of the edge (horizontal, vertical, or two possible diagonals). Here we chose the Sobel edge detection operator, although there are other operator choices [Abdou 79]. After this step, we apply another filter to thin the edges. Next, we perform a step to avoid breaking up edges where the operator output swings slightly above and below the edge threshold. Finally, we remove lone spurious edges caused by image noise. Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 103

1 #define HEIGHT 512 2 #define WIDTH 512 3 for (y= 0; y < HEIGHT; y++) { 4 for (x= 0; x < WIDTH; x++) { 5 if (not in bounds(x, y)) continue ; 6 x dir = 0; y dir = 0; 7 for (xOffset = −1; xOffset <= 1; xOffset++) { 8 for (yOffset = −1; yOffset <= 1; yOffset++) { 9 pixel=input image[y+yOffset ][x+xOffset ]; 10 x dir += pixel ∗ Gx[1+xOffset][1+yOffset ]; 11 y dir += pixel ∗ Gy[1+xOffset][1+yOffset ]; 12 } 13 } 14 edge weight = bound(x dir) + bound(y dir); 15 output image[y][x] = 255 − edge weight ; 16 } 17 }

Figure 7.2: C code for Sobel Filter.

ab c out512 element shift register in

d e f out512 element shift register in

8 g h i pixel_in

Figure 7.3: Sobel hardware line buffers and stencil shift registers

In this chapter, we focus on the second stage of the edge detector: the Sobel filter. The Sobel filter is performed by convolution, using a three pixel by three pixel stencil window. This window shifts one pixel at a time from left to right across the input image, and then shifts one pixel down and continues from the far left of the image as shown in Figure 7.1. At every position of the stencil, we calculate the edge value of the middle pixel e, using the adjacent pixels labeled from a to i. Table 7.1 gives the three by three Gx and Gy gradient masks. These constants are used to approximate the gradient in both x and y directions at each pixel of the image using the eight neighbouring pixels. The C source code for the Sobel filter is provided in Figure 7.2. The outer two loops ensure that we visit every pixels in the image, while ignoring image borders (line 5). The stencil gradient calculation is performed on lines 6–13. The edge weight is calculated on line 14, where we bound the x and y directions to be from 0 to 255. Finally, we store the edge value in the output image on line 15. We assume that the image is 512 by 512 pixels.

7.3 Custom Hardware Implementation

An experienced hardware designer, Blair Fort, has provided a hand-coded RTL implementation of the Sobel filter. The hardware design assumes that a “stream” of image pixels are being fed into the hardware module at the rate of one pixel every cycle. The hardware implementation stores the previous two rows of the image in two shift registers, or line buffers, as shown in Figure 7.3. These line buffers can be Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 104

1 inline unsigned char calculate edge () { 2 int x dir = 0, y dir = 0; 3 int xOffset , yOffset; 4 unsigned char edge weight ; 5 6 for (yOffset = −1; yOffset <= 1; yOffset++) { 7 for (xOffset = −1; xOffset <= 1; xOffset++) { 8 x dir += stencil[1+yOffset][1+xOffset] ∗ Gx[1+xOffset][1+yOffset ]; 9 y dir += stencil[1+yOffset][1+xOffset] ∗ Gy[1+xOffset][1+yOffset ]; 10 } 11 } 12 13 edge weight = bound(x dir) + bound(y dir); 14 edge weight = 255 − edge weight ; 15 return edge weight ; 16 }

Figure 7.4: Calculating the Sobel edge weight using the stencil window.

efficiently implemented in FPGA block RAMs. The incoming image pixel is labeled pixel in, while the elements of the stencil are labeled from a to i, which correspond to the labels from the stencil in Figure 7.1. Using the two 512-pixel line buffers, the hardware can retain the necessary neighbouring pixels for the stencil to operate, and update this window of pixels as each new input pixel arrives every cycle. The edge calculation can now be changed to use the stencil buffer directly, as shown in the updated C code in Figure 7.4. After sufficient cycles have passed such that the stencil holds valid data, we input the current nine values of the stencil into a three stage pipeline that calculates the x and y directions, as described in lines 6–11 in Figure 7.4. The calculation was optimized to use wire-based shifting to implement the multiply by two, twos complement for negation, and leaving out stencil values that were not needed (like the middle pixel e). This is followed by a pipeline stage to perform the bounding on line 13, another stage to perform the addition on line 13, and a final stage for the subtraction on line 14. After this pipeline, the edge weight has been calculated and can be output from the module. The hardware has some additional control to wait until the line buffers are full before the output edge data is marked as valid, and additional checks that set the output to zero if we are on the border of the image. This hardware implementation does not need an explicit FSM. To summarize, in steady state, this hardware pipeline receives a new pixel every cycle and outputs an edge value every cycle, along with an output valid bit. The first valid output pixel is after 521 clock cycles, after which point an edge will be output on every cycle for the next 262,144 cycles (512 × 512). The first 513 valid edge output values will be suppressed to zero because we are still on the border of the image. Therefore, the custom RTL circuit has a total cycle count of 262,665, which is 0.2% worse than optimal, where optimal is finishing right after the last pixel (after 262,144 cycles). The latency of 521 cycles is due to time spent filling up the line buffers in the hardware before we can begin computation.

7.4 LegUp Implementation

We now describe the LegUp synthesized circuit, starting from the original C code in Figure 7.2. By default, compiler optimizations built into LLVM will automatically unroll the innermost 3x3 loop (lines 7–13) and constant propagate the gradient values from Table 7.1 (lines 10–11). During constant propa- Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 105

1 // line buffer shift registers 2 unsigned char prev row [WIDTH] = { 0 } ; 3 int prev row index = 0; 4 unsigned char prev prev row [WIDTH] = { 0 } ; 5 int prev prev row index = 0; 6 7 // stencil buffer: 8 // stencil[0][0], stencil[0][1], stencil[0][2], 9 // stencil[1][0], stencil[1][1], stencil[1][2], 10 // stencil[2][0], stencil[2][1], stencil[2][2] 11 unsigned char stencil[3][3] = { 0 } ; 12 13 inline void receive new pixel( unsigned char pixel) { 14 15 // shift existing stencil to the left by one 16 stencil[0][0] = stencil[0][1]; stencil[0][1] = stencil [0][2]; 17 stencil[1][0] = stencil[1][1]; stencil[1][1] = stencil [1][2]; 18 stencil[2][0] = stencil[2][1]; stencil[2][1] = stencil [2][2]; 19 20 int prev row elem = prev row[ prev row index ]; 21 22 // grab next column (the rightmost column of the sliding sten cil) 23 stencil[0][2] = prev prev row[prev prev row index ]; 24 stencil[1][2] = prev row elem ; 25 stencil[2][2] = pixel; 26 27 // shift in new pixel 28 prev prev row[prev prev row index] = prev row elem ; 29 prev row[ prev row index] = pixel; 30 31 // adjust shift register indices 32 prev row index++; 33 p r e v prev row index++; 34 35 prev row index = (prev row index==WIDTH) ? 0 : prev row index ; 36 p r e v prev row index = (prev prev row i n d e x==WIDTH) ? 0 : p r e v prev row index ; 37 }

Figure 7.5: C code for the stencil buffer and line buffers synthesized with LegUp.

gation, the LLVM optimizations can detect the zero in the middle of each gradient mask allowing us to ignore the middle pixel during the iteration. Consequently, there are eight loads from the input image required during each outer loop iteration (lines 5–15), one for each pixel adjacent to the current pixel (line 9). The outer loop will iterate 262,144 (512 × 512) times. We have nine total memory operations in the loop, eight loads (line 9) and one store (line 15). We found that LegUp schedules the unmodified code into nine clock cycles per iteration, mainly due to the shared memory having only two ports and a latency of two cycles. This circuit takes 2,866,207 cycles to complete. The first transformation we can make is to use a stencil and two line buffers holding the previous two rows. The C code for this is given in Figure 7.5, with the stencil stored in a nine element two-dimensional array on line 11. We shift the stencil after each new pixel arrives on lines 16–18, and shift new data into the stencil on lines 23–25. The two line buffers are implemented on lines 28–36 using arrays: prev prev row and prev row. We have to manually keep track of an index to indicate where to shift data into and out of the arrays, with the index rolling over to zero when reaching the end of the array (lines 35–36). We can now calculate an edge value using only the stencil buffer as shown in Figure 7.4, without reading from memory eight times every loop iteration. We also enable local memories, so that we are not constrained by the global memory controller ports. Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 106

1 int sobel opt( 2 unsigned char input image [HEIGHT] [WIDTH] , 3 unsigned volatile char output image [HEIGHT ] [WIDTH] ) 4 { 5 int i, errors = 0, x offset = −1, y offset = −1, start = 0; 6 unsigned char pixel , edge weight ; 7 unsigned char ∗ input image ptr = ( unsigned char ∗)input image ; 8 unsigned char ∗ output image ptr = ( unsigned char ∗)output image ; 9 10 loop: for (i = 0; i < (HEIGHT) ∗(WIDTH) ; i ++) { 11 pixel= ∗ input image p tr++; 12 13 receive new pixel(pixel); 14 15 edge weight = calculate edge(); 16 17 // we only want to start calculating the value when 18 // the shift registers are full and the window is valid 19 int check = (i == 512∗2+2) ; 20 x offset = (check) ? 1: x offset ; 21 y offset = (check) ? 1: y offset ; 22 start = (!start) ? check : start; 23 int border = not in bounds(x offset , y offset) + !start; 24 25 output image[ y offset ][ x offset] = (border) ? 0 : edge weight ; 26 27 // error checking 28 int incorrect = errors + (edge weight != golden[y offset ][ x offset ]); 29 errors = (border) ? errors : incorrect; 30 31 x offset++; 32 y offset = (x o f f s e t == WIDTH−1) ? (y offset + 1) : y offset ; 33 x offset = (x o f f s e t == WIDTH−1) ? −1 : x offset ; 34 } 35 36 return errors; 37 }

Figure 7.6: Optimized C code for synthesized Sobel Filter with LegUp. Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 107

Table 7.2: Experimental Results. Metric Hand-RTL LegUp LegUp/Hand-RTL FMax (MHz) 191.46 187.13 0.98 Cycles 262,665 262,156 1.00 Time (ms) 1.37 1.40 1.02 ALUTs 495 813 1.64 Registers 382 635 1.66 Memory (bits) 6,299,616 6,299,616 1.00

Next, we need to enable loop pipelining, to overlap iterations of the outermost loop of the algorithm. But first we must merge the two outer loops into one loop and add a label “loop”. We also change the array accesses to use pointer dereferencing to avoid unnecessary index calculations. Second, we must manually remove any control flow in the loop body to allow loop pipelining because automatic if-conversion is not yet implemented in LegUp. We do this by replacing any if statements with the ternary operator, “? :”. We show the new C code in Figure 7.6, where the incoming pixel is read on line 11, the stencil and line buffers are shifted on line 13, and the edge weight is calculated on line 15. We have added some new control variables, such as a check for when the stencil has been filled (line 19), a calculation of the output image x and y indices (lines 20–21 and lines 31–33), whether the output is now valid (line 22), and an additional check for whether we are on the image border (line 23). If we are on the border, we output a zero, otherwise we output the edge weight (line 25). We moved error checking into the loop body to avoid an additional loop afterwards to verify the output, which would take another 262,144 cycles (512 × 512). LegUp does not have any facility for allowing one loop to “stream” into a successive loop, so we instead manually fuse the loops together. In general, we would remove the error checking logic after the circuit is verified to avoid wasting silicon area. There is only one load (line 11) and one store (line 25) in the loop body that go to the dual-ported shared global memory controller. Therefore, we can pipeline the transformed loop with an initiation interval of one. The circuit now finishes after 262,156 cycles, only 12 cycles worse than optimal. Although we do assume that the output image is already initialized to zero for the very last row of the image (the bottom border). We also set the LegUp clock period scheduling constraint as low as possible, to ensure a better final circuit FMax. Some of the C transformations we have just described would be unintuitive to software developers. Particularly using the line buffers and stencil to reduce memory operations in the loop. The user would have to be familiar with the concept of a pipeline initiation interval and the strategy of reducing memory contention in the loop body. Also they have to rewrite all control flow in the loop to use the ternary operator.

7.5 Experimental Study

We measured the results of the LegUp vs hand RTL case study by targeting the Stratix IV [Stra 10] FPGA (EP4SGX530KH40C2) on Altera’s DE4 board [DE4 10] using Quartus II 13.1SP2 to obtain area and FMax metrics. Quartus timing constraints were configured to optimize for the highest achievable clock frequency. The results are summarized in Table 7.2, with the custom hardware implementation shown in the first column side-by-side with the LegUp synthesized results in the second column. In the third column, we compute the ratio of the two results: LegUp / Hand-RTL. We found that after performing manual code transformations, LegUp produced a circuit with a wall- Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 108 clock time within 2% of the hand-written hardware implementation. However, the synthesized circuit area was larger, consuming 64% more ALUTs and 66% more registers. We observed a few reasons for this increase in area. First, we are using many unnecessary additional registers due to the low LegUp clock period constraint, which causes scheduling to not chain any opera- tions. We needed the clock period constraint to achieve an acceptable FMax but this indicates that our timing analysis and estimation needs improvement. Also, the pipeline produced by LegUp is needlessly complex and includes a lot of additional array indexing that did not exist in the custom implementa- tion. LegUp also generates a standard FSM when this application does not need one. Generally, the custom hardware implementation is very minimalistic and fits in a few short pages of Verilog (296 lines without comments). In contrast, LegUp’s implementation is 2,238 lines and includes many unnecessary operations from the LLVM intermediate representation, such as sign extensions and memory indexing. We expect that LegUp will perform well for hardware modules that are control-heavy and fairly sequential. For hardware modules with pipelining, we will also perform well as long as the user can express the pipelining in a single C loop. These could include video, media, and networking applications. LegUp, and all HLS tools, will struggle with highly optimized hardware designs with a known structure such as a fast Fourier transform butterfly architecture. Also LegUp cannot generate circuits that have exact cycle-accurate timing behaviour such as a bus controller.

7.6 Summary

This chapter provided a case study comparing a LegUp synthesized Sobel image filter to an equiva- lent hand-written version. We described transformations that were performed on the input C code to synthesize a better final design in LegUp. We show that LegUp synthesizes a circuit with a wall-clock time within 2% of the hand-designed circuit but with about 65% more area. We hope this case study emphasizes the importance of coding style during HLS and motivates the need for better support of streaming applications in LegUp. Chapter 8

Conclusions

8.1 Summary and Contributions

With the end of processor frequency scaling and as Moore’s law continues, we need a better approach to harness the increased number of transistors on a silicon chip. The industry has been moving towards heterogeneous computing with the use of custom hardware accelerators to achieve higher performance. Field-programmable gate arrays (FPGAs) are a way to realize these accelerators, especially as FPGAs continue to grow in size, now including complete system-on-chips with hardened on-chip ARM processors. However, hardware design remains a difficult process especially for software developers. We present a new design entry methodology that offers a higher level of abstraction, where the user incrementally moves their design from a processor to custom hardware. The hardware is automatically synthesized from software using high-level synthesis. By allowing designers to program in software, they escape the need to perform tedious cycle-accurate hardware design and they can re-synthesize their design for future FPGA chips without redesigning the circuit datapath and control. This dissertation has contributed an open-source robust high-level synthesis tool, LegUp, to the research community. Furthermore, we described FPGA-specific high-level synthesis optimizations and an improved state-of-the-art HLS loop pipelining algorithm:

Chapter 3 discussed our LegUp open-source HLS framework and our proposed design methodology. We consider the LegUp infrastructure itself to be a major contribution of this dissertation. LegUp is implemented using state-of-the-art HLS algorithms and a large test suite that ensures correct- ness of a greater variety of C programs than in previous academic tools. We compared LegUp’s performance to a commercial HLS tool across a suite of benchmark circuits and found that ge- omean wall-clock time was 18% faster while having 16% higher geomean area. The LegUp soft- ware has been downloaded by 1200 unique researchers from all over the world and is available online at: legup.eecg.utoronto.ca. To the author’s knowledge, this represents the first open- source HLS tool ever published targeting FPGAs with comprehensive coverage of the C language and a hybrid processor/accelerator architecture. This work has been published in [Canis 11, Canis 12, Canis 13b]. LegUp has been used for recent HLS research contributions in debug- ging [Calag 14, Goede 14], circuit area minimization [Gort 13, Klimo 13, Hadji 12a], compiler optimizations [Huang 13, Huang 14], performance optimizations [Hadji 12b], parallel program- ming [Choi 13], cache architecture [Choi 12a], and even for hardware design contests [Cai 13]. The LegUp project received the Community Award at FPL 2014 for contributions to open-source

109 Chapter 8. Conclusions 110

high-level synthesis.

Chapter 4 presented a new FPGA architecture-specific enhancement to high-level synthesis, where we multi-pump functional units that can run at higher clock speeds than the surrounding logic to facilitate additional resource sharing. Our method was shown to be particularly effective for the ASIC-like DSP blocks on modern FPGAs. We showed that multi-pumping achieves the same DSP reduction as previous resource sharing approaches, but with better circuit performance: decreasing circuit speed by only 5% instead of 80%, across a suite of digital signal processing benchmarks. This work has been published in [Canis 13a].

Chapter 5 described a novel HLS loop pipelining scheduling algorithm using the SDC mathematical framework. Our approach improves upon prior work by providing better handling of scheduling constraints using a backtracking mechanism, which can achieve better pipeline throughput. We also described a method for restructuring associative expressions within loops to reduce recurrence constraints that can hurt pipeline throughput. We compared our approach to prior work and a commercial HLS tool and we found a geomean wall-clock time improvement of 32% and 29% respectively, across a suite of benchmark circuits. This work has been published in [Canis 14].

Chapter 6 discussed LegUp’s synthesized on-chip memory architecture. We described global memory, which is accessed using a shared memory controller to support arbitrary pointers in the C input program. We discussed how LegUp partitions program memory into local and global physical on-chip RAMs by using static points-to analysis techniques. These local memories improve cir- cuit performance compared to using only global memories. We also explored grouping program memories with compatible bitwidths into a shared physical on-chip RAM to better utilize the ded- icated RAM blocks available on the FPGA device (M9K blocks on Stratix IV). We applied these approaches and showed a reduction in the geomean memory implementation bits by 37%, and a decrease in geomean wall-clock time by 12%, across the CHStone benchmark suite. This work has been published in [Fort 14].

Chapter 7 presented a case study to measure the gap between HLS-generated and hand-written circuit implementations using a Sobel image filter kernel from the computer vision domain. We compared a hand-written streaming hardware implementation to a circuit synthesized with LegUp. We found that after performing manual code transformations, LegUp produced a circuit with a wall-clock execution time within 2% of the hand-coded RTL. However, the synthesized circuit area was larger, consuming 64% more ALUTs and 66% more registers.

8.2 Future Work

Ever since our first release in 2011, LegUp has been designed to enable other academics to explore future HLS research directions. There remains several active research areas that merit further exploration. In this section, we will discuss extensions to the work presented in this dissertation and suggest improve- ments specifically for the LegUp framework. We will then suggest other promising areas that were not covered in the preceding chapters. Chapter 8. Conclusions 111

8.2.1 Extensions of this Research Work

Our experimental study in Chapter 4 focused on area savings using multi-pumping. We could also investigate the impact on circuit power and energy, which will depend on the energy consumption of the DSP blocks operating at a higher clock speed. Future work also could investigate multi-pumping as a general sharing technique for other types of FPGA functional units. For example, multi-pumping FPGA block RAMs would offer us more memory ports as described in [Choi 12a]. Also, we could extended this work to multi-pump the new hardened floating point units in Stratix 10 [Stra 14]. Another idea is to use multi-pumping to improve circuit performance and throughput, particularly for loop pipelining, instead of focusing on resource sharing. Finally, applying our approach on slower circuits would allow quad-pumping the DSPs (with a 4× clock) to achieve even more area savings. For an extension to the loop pipelining scheduler in Chapter 5, we could study the impact of loop unrolling when combined with loop pipelining. Loop unrolling has the effect of duplicating the pipeline, increasing circuit area while also improving throughput, which would be an interesting trade-off to explore. Additionally, cross-iteration dependencies involving floating point operations can greatly increase the pipeline initiation interval because these operations are typically heavily pipelined to achieve acceptable clock frequencies. In LegUp, floating point addition/subtraction functional units are pipelined to 14 cycles by default. Currently, the functional unit latencies are hard-coded in the allocation step of LegUp, however, the work in [Ben A 08] investigated a variable pipeline scheduler that determines the appropriate number of pipeline stages for each functional unit during scheduling. Future work could involve extending our algorithm to detect critical recurrences in loop pipelines and attempt to lower the latency of functional units along the recurrence. We could likewise modify the latency of global memory load operations, which have a two cycle latency in LegUp. The memory optimizations in Chapter 6 could be extended to include better context and flow sensitive points-to analysis techniques [Zhu 04] to identify local memories more accurately in the program. We could also add support for semi-local arrays by instantiating distributed memory controllers throughout the hardware hierarchy. Semi-local arrays occur whenever pointers point to multiple arrays or for when arrays are shared between multiple functions. We could also handle the connections between each memory access and the associated physical RAM using an interconnect generator like Altera’s Qsys [Qsys 14]. We should also investigate if LegUp is better off with a flat module hierarchy instead of a tree hierarchy as shown in Figure 6.7. In the flat hierarchy, we can instantiate every module once regardless of the program call graph, which saves area. Modules would be connected together at the top-level, with modules only connected if the corresponding C functions called each other. Another avenue for future work is memory partitioning, which involves splitting arrays into registers or smaller block RAMs. We could use memory access patterns to automatically split arrays into distinct physical RAMs. This is particularly effective for achieving greater parallelism during loop pipelining. Furthermore, we could explore the reverse: storing registers in RAMs. Currently the shared global memory controller is dual ported. But we could investigate increasing the number of ports to allow greater memory bandwidth. The HLS scheduler would be responsible for insuring that we never perform more than two memory accesses to the same global memory block. Another idea is to assign a global memory block to only one port of the memory controller if the memory is only every accessed once per cycle. This would reduce output multiplexing in the controller. Chapter 8. Conclusions 112

For grouping memory, we could investigate sharing arrays with different bitwidths in the same RAM. This would require additional steering logic in the memory controller. Also, the arrays with smaller bitwidths would waste space when placed in the wider shared RAM. We could also try matching the size of our physical RAMs to the size of the available FPGA block RAMs. Alternatively, for a small array we may forgo memory entirely and store the array elements in separate registers. We could also explore storing variables from mutually exclusive program scopes or mutually exclusive program execution in the same physical memory using lifetime analysis as discussed in [Zhu 01]. Our case study comparing HLS to hand-written RTL in Chapter 7 could be extended to include other benchmarks. In fact, we believe the research community would find value in having a new benchmark suite that contains two sets of equivalent designs: a reference in C code and a hand-written hardware implementation. This could help researchers focus on closing the gap between HLS and custom design. Another area for future work is investigating if a LegUp pass could detect these types of image filter memory dependencies and instantiate the line buffers (Figure 7.3) automatically. Wang [Wang 14] has investigated this memory partitioning problem by using the polyhedral model to represent the iteration space of a loop nest and the associated memory accesses and then inferring memory banks for parallel memory accesses. For more complex loop nests we may require the user to profile the memory access patterns in advance. We should also add support in LegUp for streaming-style C code using Pthreads to allow us to rewrite this code in a form more understandable to software developers.

8.2.2 Improvements to LegUp

The long-term vision for LegUp is to fully automate the flow in Figure 3.1, thereby creating a self- accelerating adaptive processor that will profile running applications and automatically synthesize critical code regions into hardware, improving performance without user intervention. Self-acceleration would require on-the-fly FPGA synthesis, place, and route of generated hardware accelerators, which can take minutes or hours for larger circuits. Therefore, this flow would be most suitable for long-running applications. In the hardware synthesized by LegUp, we still rely on Altera-specific hardware primitives, such as floating point cores, dividers, and the Avalon bus. We should move towards using the popular AMBA AXI (Advanced extensible Interface) bus defined by ARM [AMBA 03]. We could also add support for custom floating point units generated using FloPoCo [De Di 11]. Making these changes would enable us to support Xilinx FPGAs, which is by far the most requested LegUp feature. In the future, LegUp should support a hybrid flow that includes a x86 processor connected over the PCIe bus to LegUp-synthesized hardware accelerators implemented on an FPGA. We have a prototype of this flow working, but we need more testing to make it robust. This could allow experiments comparing FPGA hardware accelerators with the performance of commodity GPU cards (typically programmed with CUDA) for high performance computing workloads. We could also investigate supporting other language extensions, such as OpenCL [Openc 09], to allow the user to express further parallelism. LegUp still needs better support for off-chip memory. Currently we only support off-chip memory in the hybrid flow by using the shared processor cache. Instead, in the pure hardware flow we should be able to read and write directly to off-chip DDR3 RAM, possibly buffering the result in a FIFO. We could investigate how to better support streaming applications in LegUp, for instance by inferring common hardware idioms like line buffers. A limitation of the CHStone benchmarks is that they do not offer much opportunity for parallelism. We could create a new benchmark suite that focuses on designs Chapter 8. Conclusions 113 that can be parallelized, with streaming examples that can be pipelined, or applications that use explicit parallelism with PThreads and OpenMP. There are many other smaller improvements that can be made to LegUp, such as: support for fixed-point integer arithmetic, user specified variable bitwidths, and better timing and area estimation.

8.2.3 Additional High-Level Synthesis Research Directions

We believe that debugging in HLS is an important topic that is still an open research question. We still do not know the best way for a user to debug a synthesized hardware design, especially a SoC with parts of the code running on the processor. Visualizing the hardware circuit in an intuitive way is especially important for winning over software developers. Another active area of research is on loop transformations in HLS, using the polyhedral model [Basto 04]. We could start by using the Polly polyhedral LLVM framework [Gross 11] to provide more detailed in- formation about cross-iteration loop dependencies such as dependence distances based on array indices. Using this framework, we could investigate more complex loops dependencies and perform loop transfor- mations such as loop fusion, loop interchange, and loop skewing that can expose further loop parallelism. These transformations can be applied during loop pipelining to improve parallelism and memory access patterns [Pouch 13], which can greatly increasing circuit performance. We believe this is fertile ground for new research. Another relevant area to explore is power and energy optimizations in HLS. We could explore energy- driven scheduling and FSM generation to minimize toggle rates. Or we could work on power-aware binding approaches that look at operator power characteristics instead of area. The processor and accelerators in our target architecture currently share a single clock signal. We could also investigate the benefits of using multiple clock domains, where each processor and accelerator can operate at its maximum speed and communication between modules occurs across clock domains.

8.3 Closing Remarks

In summary, we believe that LegUp offers a platform for researchers to continue pressing forward with high-level synthesis progress. High-level synthesis targeting FPGAs will continue to be an active research area in the years to come. The optimizations described in this dissertation improve the target circuit architecture synthesized by HLS tools. However, further improvements, particularly in the area of pipelining and automatic parallelization may be needed for massive adoption by hardware designers targeting FPGAs. We are optimistic that the advantages of HLS will continue to win over hardware designers and raise the productivity of our industry. We hope that researchers will continue to improve HLS until we reach the holy grail: synthesizing a circuit from software that is just as good (or better) than a hand-designed implementation. References

[Abdou 79] I. E. Abdou and W. Pratt. “Quantitative design and evaluation of enhancement/thresh- olding edge detectors”. Proceedings of the IEEE, Vol. 67, No. 5, pp. 753–763, 1979.

[Adam 74] T. L. Adam, K. M. Chandy, and J. Dickson. “A comparison of list schedules for parallel processing systems”. Communications of the ACM, Vol. 17, No. 12, pp. 685–690, 1974.

[Aldha 11a] M. Aldham, J. Anderson, S. Brown, and A. Canis. “Low-Cost Hardware Profiling of Run- Time and Energy in FPGA Embedded Processors”. In: IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Santa Monica, CA, 2011.

[Aldha 11b] M. Aldham. Low-Cost Hardware Profiling of Run-Time and Energy in FPGA Soft Proces- sors. PhD thesis, 2011.

[Alte 13] Altera 2012 Annual Report (Form 10-K). http://www.altera.com, 2013.

[AMBA 03] A. AMBA. “Protocol Specification”. ARM, June, 2003.

[Ander 94] L. O. Andersen. Program analysis and specialization for the C programming language. PhD thesis, University of Cophenhagen, 1994.

[ARM 14] ARM Benchmark Results. http://legup.eecg.utoronto.ca/wiki/ doku.php?id=arm chstone benchmark results, 2014.

[ARM 11] ARM. “Cortex-A9 Processor”. http://www.arm.com/products/processors/cortex-a/cortex- a9.php, 2011.

[Aubur 96] M. Aubury, I. Page, G. Randall, J. Saul, and R. Watts. “Handel-C language reference guide”. Computing Laboratory. Oxford University, UK, 1996.

[Auto] AutoESL Design Technologies, Inc. http://www.autoesl.com.

[Aval 10] Avalon Interface Specification. Altera, Corp., San Jose, CA, 2010.

[Basto 04] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. “Putting polyhedral loop transformations to work”. In: Languages and Compilers for Parallel Computing, pp. 209– 225, Springer, 2004.

[BDTI] BDTI Certified Results for the AutoESL AutoPilot High-Level Synthesis Tool. http://www.bdti.com/Resources/ BenchmarkResults/HLSTCP/AutoPilot.

114 References 115

[Beida 05] R. Beidas and J. Zhu. “Scalable Interprocedural Register Allocation for High Level Synthe- sis”. In: Proceedings of the 2005 Asia and South Pacific Design Automation Conference, pp. 511–516, ACM, New York, NY, USA, 2005.

[Ben A 08] Y. Ben-Asher and N. Rotem. “Synthesis for Variable Pipelined Function Units”. In: IEEE International Symposium on System-on-C, 2008.

[Betz 97] V. Betz and J. Rose. “VPR: A New Packing, Placement and Routing Tool for FPGA Research”. In: Int’l Workshop on Field Programmable Logic and Applications, pp. 213– 222, 1997.

[Betz 99] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for deep-submicron FPGAs. Kluwer Academic Publishers, 1999.

[Blue] Bluespec: The Synthesizable Modeling Company. http://www.bluespec.com.

[Borka 11] S. Borkar and A. A. Chien. “The Future of Microprocessors”. Commun. ACM, Vol. 54, No. 5, pp. 67–77, May 2011.

[Brodt 10] A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, and O. O. Storaasli. “State- of-the-art in heterogeneous computing”. Scientific Programming, Vol. 18, No. 1, pp. 1–33, 2010.

[Buttl 96] D. Buttlar and J. Farrell. Pthreads programming: A POSIX standard for better multipro- cessing. ” O’Reilly Media, Inc.”, 1996.

[Cade] Cadence C-to-Silicon Compiler. http://www.cadence.com/products/sd/ silicon compiler.

[Cai 13] J. C. Cai, R. Lian, M. Wang, A. Canis, J. Choi, B. Fort, E. Hart, E. Miao, Y. Zhang, N. Calagar, et al. “From C to Blokus Duo with LegUp high-level synthesis”. In: Field- Programmable Technology (FPT), 2013 International Conference on, pp. 486–489, IEEE, 2013.

[Calag 14] N. Calagar, S. D. Brown, and J. H. Anderson. “Source-level debugging for FPGA high-level synthesis”. In: Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pp. 1–8, IEEE, 2014.

[Caly] Calypto Catapult. http://calypto.com/en/products/catapult/overview.

[Campo 91] R. Camposano. “Path-based scheduling for synthesis”. Computer-Aided Design of Inte- grated Circuits and Systems, IEEE Transactions on, Vol. 10, No. 1, pp. 85–93, Jan 1991.

[Canis 11] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and T. Czajkowski. “LegUp: high-level synthesis for FPGA-based processor/accelerator sys- tems”. In: ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 33–36, 2011.

[Canis 12] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. D. Brown, , and J. H. Anderson. “LegUp: An Open Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator Systems”. ACM Transactions on Embedded Computing Systems (TECS), 2012. References 116

[Canis 13a] A. Canis, J. H. Anderson, and S. D. Brown. “Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis”. In: IEEE Design Automation and Test in Europe Conference (DATE), Grenoble, France, 2013.

[Canis 13b] A. Canis, J. Choi, B. Fort, R. Lian, Q. Huang, N. Calagar, M. Gort, J. J. Qin, M. Aldham, T. Czajkowski, S. Brown, and J. Anderson. “From Software to Accelerators with LegUp High-level Synthesis”. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pp. 18:1–18:9, IEEE Press, Piscataway, NJ, USA, 2013.

[Canis 14] A. Canis, S. Brown, and J. Anderson. “Modulo SDC Scheduling with Recurrence Mini- mization in High-Level Synthesis”. In: International Conference on Field-Programmable Logic and Applications, 2014.

[Canny 86] J. Canny. “A computational approach to edge detection”. Pattern Analysis and Machine Intelligence, IEEE Transactions on, No. 6, pp. 679–698, 1986.

[Ceba] CebaTech The software to silicon company. http://www.cebatech.com.

[Chait 81] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. “Register allocation via coloring”. Computer languages, Vol. 6, No. 1, pp. 47– 57, 1981.

[Chen 04] D. Chen and J. Cong. “Register Binding and Port Assignment for Multiplexer Optimiza- tion”. In: IEEE/ACM Asia and South Pacific Design Automation Conference, pp. 68–73, 2004.

[Choi 12a] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. “Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel- Accelerator Systems”. In: IEEE Symposium on Field-Programmable Custom Computing Machines, 2012.

[Choi 12b] J. Choi. Enabling Hardware/Software Co-design in High-level Synthesis. PhD thesis, University of Toronto, 2012.

[Choi 13] J. Choi, S. Brown, and J. Anderson. “From software threads to parallel hardware in high- level synthesis for FPGAs”. In: Field-Programmable Technology (FPT), 2013 International Conference on, pp. 270–277, IEEE, 2013.

[Cisc 14] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 20132018. http://www.cisco.com, Feb. 2014.

[Codre 14] L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, C. Tabony, and R. Maule. “Hexagon DSP: An Architecture Optimized for Mobile Multi- media and Communications”. Micro, IEEE, Vol. 34, No. 2, pp. 34–43, Mar 2014.

[Cong 06a] J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang. “Platform-Based Behavior-Level and System-Level Synthesis”. In: IEEE Int’l System-on-Chip Conference, pp. 199–202, 2006. References 117

[Cong 06b] J. Cong and Z. Zhang. “An efficient and versatile scheduling algorithm based on SDC formulation”. In: IEEE/ACM Design Automation Conference, pp. 433–438, 2006.

[Cong 06c] J. Cong, Y. Fan, and W. Jiang. “Platform-Based Resource Binding Using a Distributed Register-File Microarchitecture”. San Jose, CA, 2006.

[Cong 08] J. Cong and J. Wei. “Pattern-based behavior synthesis for FPGA resource reduction”. In: Int’l ACM/SIGDA symposium on Field programmable gate arrays, pp. 107–16, 2008.

[Cong 09] J. Cong and Y. Zou. “FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation”. ACM Trans. Reconfigurable Technol. Syst., Vol. 2, No. 3, pp. 1–29, 2009.

[Cong 10] J. Cong, B. Liu, and J. Xu. “Coordinated Resource Optimization in Behavioral Synthesis”. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1267– 1272, 2010.

[Cong 11] J. Cong, B. L., S. Neuendorffer, J. Noguera, K. Vissers, and Z. Z. “High-Level Synthesis for FPGAs: From Prototyping to Deployment”. IEEE Tran. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 30, No. 4, pp. 473–491, April 2011.

[Cong 12] J. Cong, P. Zhang, and Y. Zou. “Optimizing memory hierarchy allocation with loop trans- formations for high-level synthesis”. In: Proceedings of the 49th Annual Design Automation Conference, pp. 1233–1238, ACM, 2012.

[Coolea] J. Cooley. “750 engineer survey on HLS verification issues & power reduction”.

[Cooleb] J. Cooley. “My Cheesy Must See List for DAC 2014”.

[Couss 09] P. Coussy, D. Gajski, M. Meredith, and A. Takach. “An Introduction to High-Level Syn- thesis”. IEEE Design Test of Computers, Vol. 26, No. 4, pp. 8 – 17, jul. 2009.

[Couss 10] P. Coussy, G. Lhairech-Lebreton, D. Heller, and E. Martin. “GAUT A Free and Open Source High-Level Synthesis Tool”. In: IEEE Design Automation and Test in Europe – University Booth, 2010.

[Crave 07] S. Craven and P. Athanas. “Examining the viability of FPGA supercomputing”. EURASIP Journal on Embedded systems, Vol. 2007, No. 1, pp. 13–13, 2007.

[CUDA 07] CUDA: Compute Unified Device Architecture Programming Guide. NVIDIA CORPORA- TION, 2007.

[Cycl 04] Cyclone-II Data Sheet. Altera, Corp., San Jose, CA, 2004.

[DDR3 08] DDR3 SDRAM Standard (JESD 79-3B). JEDEC Solid State Technology Assoc., 2008.

[De Di 11] F. De Dinechin and B. Pasca. “Designing custom arithmetic data paths with FloPoCo”. IEEE Design & Test of Computers, Vol. 28, No. 4, pp. 0018–27, 2011.

[DE1 13] DE1-SoC Development and Education Board. Altera, Corp., San Jose, CA, 2013.

[DE2 10a] DE2-115 Development Board. Altera, Corp., San Jose, CA, 2010. References 118

[DE2 10b] DE2 Development and Education Board. Altera, Corp., San Jose, CA, 2010.

[DE4 10] DE4 Development Board. Altera, Corp., San Jose, CA, 2010.

[DE5 13] DE5-Net Development Board. Terasic, 2013.

[Denna 07] R. H. Dennard, J. Cai, and A. Kumar. “A perspective on todays scaling challenges and possible future directions”. Solid-State Electronics, Vol. 51, No. 4, pp. 518 – 525, 2007.

[Denna 74] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc. “Design of ion- implanted MOSFET’s with very small physical dimensions”. Solid-State Circuits, IEEE Journal of, Vol. 9, No. 5, pp. 256–268, Oct 1974.

[Docu 11] Documentation: SOPC Builder. http://www.altera.com/literature/lit-sop.jsp, 2011.

[Ellsw 04] M. Ellsworth. “Chip power density and module cooling technology projections for the current decade”. In: Thermal and Thermomechanical Phenomena in Electronic Systems, 2004. ITHERM ’04. The Ninth Intersociety Conference on, pp. 707–708 Vol.2, June 2004.

[Esmae 13] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. “Power Chal- lenges May End the Multicore Era”. Commun. ACM, Vol. 56, No. 2, pp. 93–102, Feb. 2013.

[eXCi 10] eXCite C to RTL Behavioral Synthesis 4.1(a). Y Explorations (XYI), San Jose, CA, 2010.

[Finge 10] M. Fingeroff. High-level synthesis blue book. Xlibris Corporation, 2010.

[Fort] Forte Design Systems The high-level design company. http://www.forteds.com/products/cynthesizer.asp.

[Fort 14] B. Fort, A. Canis, J. Choi, N. Calagar, R. Lian, S. Hadjis, Y. Chen, M. Hall, B. Syrowik, C. T., S. Brown, and J. Anderson. “Automating the Design of Processor/Accelerator Embedded Systems with LegUp High-Level Synthesis”. In: IEEE Int’l Conference on Embedded and Ubiquitous Computing (EUC), Milan, Italy, August 2014.

[Fritt] J. E. Fritts, F. W. Steiling, and J. A. Tucek. “MediaBench II Video: Expediting the next generation of video systems research”. In: Electronic Imaging 2005, Int’l Society for Optics and Photonics.

[Fu 11] H. Fu and R. G. Clapp. “Eliminating the Memory Bottleneck: An FPGA-based Solution for 3D Reverse Time Migration”. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 65–74, ACM, New York, NY, USA, 2011.

[Gajsk 00] D. D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. “SPECC: Specification Language and Methodology”. 2000.

[Gajsk 92] D. Gajski and et. al. Editors. High-Level Synthesis - Introduction to Chip and System Design. Kulwer Academic Publishers, 1992. References 119

[Goede 14] J. Goeders and S. J. Wilton. “Effective FPGA debug for high-level synthesis generated circuits”. In: Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, pp. 1–8, IEEE, 2014.

[Gort 13] M. Gort and J. H. Anderson. “Range and Bitmask Analysis for Hardware Optimization in High-Level Synthesis”. In: Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, 2013.

[Gross 11] T. Grosser, H. Zheng, R. A, A. Simburger, A. Grosslinger, and L.-N. Pouchet. “Polly - Polyhedral optimization in LLVM”. In: First International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Chamonix, France, Apr. 2011.

[Gupta 03] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau. “SPARK: A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations”. In: Proc. Int. Conf. on VLSI Design, 2003.

[Hadji 12a] S. Hadjis, A. Canis, J. Anderson, J. Choi, K. Nam, S. Brown, and T. Czajkowski. “Impact of FPGA Architecture on Resource Sharing in High-Level Synthesis”. In: ACM/SIGDA Int’l Symp. on Field Programmable Gate Arrays, pp. 111–114, 2012.

[Hadji 12b] S. Hadjis, A. Canis, R. Sobue, Y. Hara-Azumi, H. Tomiyama, and J. Anderson. “Profiling- driven multi-cycling in FPGA high-level synthesis”. ACM/IEEE Design Automation and Test in Europe Conference (DATE), 2012.

[Hagog 04] M. Hagog and A. Zaks. “Swing Modulo Scheduling for GCC”. In: Proc. GCC Developers Summit, pp. 55–64, 2004.

[Hara 12] Y. Hara-Azumi and H. Tomiyama. “Clock-constrained simultaneous allocation and binding for multiplexer optimization in high-level synthesis”. In: Asia and South Pacific Design Automation Conference, pp. 251–256, 2012.

[Hara 09] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. “Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis”. Journal of Information Processing, Vol. 17, No. , pp. 242–254, 2009.

[Harde 07] B. Hardekopf and C. Lin. “The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code”. In: ACM SIGPLAN Notices, pp. 290–299, ACM, 2007.

[Hawic 08] K. A. Hawick and H. A. James. “Enumerating Circuits and Loops in Graphs with Self-Arcs and Multiple-Arcs.”. In: FCS, pp. 14–20, 2008.

[Heine 13] A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. “Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel R Xeon Phi Coprocessor”. In: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp. 126– 137, IEEE, 2013.

[Hind 00] M. Hind and A. Pioli. “Which Pointer Analysis Should I Use?”. SIGSOFT Softw. Eng. Notes, Vol. 25, No. 5, pp. 113–123, Aug. 2000. References 120

[Hind 01] M. Hind. “Pointer Analysis: Haven’t We Solved This Problem Yet?”. In: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, pp. 54–61, ACM, New York, NY, USA, 2001.

[Hisam 00] D. Hisamoto, W.-C. Lee, J. Kedzierski, H. Takeuchi, K. Asano, C. Kuo, E. Anderson, T.-J. King, J. Bokor, and C. Hu. “FinFET-a self-aligned double-gate MOSFET scalable to 20 nm”. Electron Devices, IEEE Transactions on, Vol. 47, No. 12, pp. 2320–2325, 2000.

[Huang 08] S. Huang, A. Hormati, D. Bacon, and R. Rabbah. “Liquid Metal: Object-Oriented Pro- gramming Across the Hardware/Software Boundary”. In: 22nd European conference on Object-Oriented Programming, pp. 76–103, 2008.

[Huang 13] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Anderson. “The effect of compiler optimizations on high-level synthesis for FPGAs”. In: IEEE Int’l Symposium on Field-Programmable Custom Computing Machines (FCCM), Seattle, WA, 2013.

[Huang 14] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Anderson. “The effect of compiler optimizations on high-level synthesis-generated hardware”. ACM Transactions on Recongurable Technology and Systems (TRETS), Apr. 2014.

[Huang 90] C. Huang, Y. Che, Y. Lin, and Y. Hsu. “Data Path Allocation Based on Bipartite Weighted Matching”. In: Design Automation Conference, pp. 499–504, 1990.

[Hwang 91] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. “A formal approach to the scheduling problem in high level synthesis”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol. 10, No. 4, pp. 464–475, 1991.

[Impu] Impulse CoDeveloper – Impulse accelerated technologies. http://www.impulseaccelerated.com.

[Inte 13] Interactive Presentation on Key Trend For Advanced Technologies and Role of SOI. Inter- national Business Strategies, Inc., october 2013.

[Iqbal 93] Z. Iqbal, M. Potkonjak, S. Dey, and A. Parker. “Critical path minimization using retiming and algebraic speed-up”. In: DAC, 1993.

[Jasro 04] K. Jasrotia and J. Zhu. “Stacked FSMD: a power efficient micro-architecture for high level synthesis”. In: Quality Electronic Design, 2004. Proceedings. 5th International Symposium on, pp. 425–430, 2004.

[Jiang 08] W. Jiang, Z. Zhang, M. Potkonjak, and J. Cong. “Scheduling with Integer Time Budgeting for Low-power Optimization”. In: Proceedings of the 2008 Asia and South Pacific Design Automation Conference, pp. 22–27, IEEE Computer Society Press, Los Alamitos, CA, USA, 2008.

[Jones 91] R. B. Jones and V. H. Allan. “Software Pipelining: An Evaluation of Enhanced Pipelining”. In: Proceedings of the 24th Annual International Symposium on Microarchitecture, pp. 82– 92, ACM, New York, NY, USA, 1991. References 121

[Klimo 13] A. Klimovic and J. H. Anderson. “Bitwidth-optimized hardware accelerators with software fallback”. In: Field-Programmable Technology (FPT), 2013 International Conference on, pp. 136–143, IEEE, 2013.

[Ku 88] D. C. Ku and G. De Micheli. “Hardware C-a language for hardware design”. Tech. Rep., DTIC Document, 1988.

[Kuhn 10] H. Kuhn. “The Hungarian Method for the Assignment Problem”. In: 50 Years of Integer Programming 1958-2008, pp. 29–47, Springer, 2010.

[Lam 06] M. Lam, R. Sethi, J. Ullman, and A. Aho. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 2006.

[Lam 88] M. Lam. “Software pipelining: An effective scheduling technique for VLIW machines”. In: ACM Sigplan Notices, pp. 318–328, ACM, 1988.

[Landi 92] W. Landi. “Undecidability of Static Analysis”. ACM Lett. Program. Lang. Syst., Vol. 1, No. 4, pp. 323–337, Dec. 1992.

[Lattn 04] C. Lattner and V. Adve. “LLVM: A compilation framework for lifelong program analysis & transformation”. In: IEEE CGO, pp. 75–86, http://www.llvm.org, 2004.

[Lempe 11] O. Lempel. “2nd generation intel core processor family: Intel core i7, i5 and i3”. In: Hot Chips, 2011.

[List 14] List of semiconductor fabrication plants. http://en.wikipedia.org/wiki/ List of semiconductor fabrication plants, 2014.

[lpso 14] “lp solve LP Solver”. http://lpsolve.sourceforge.net/5.5/, 2014.

[Luu 09] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. “FPGA-based Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy”. In: IEEE Sympo- sium on Field-Programmable Custom Computing Machines, pp. 157–164, 2009.

[Luu 14a] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang, T. Liu, N. Ahmed, et al. “VTR 7.0: next generation architecture and CAD system for FPGAs”. ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 7, No. 2, p. 6, 2014.

[Luu 14b] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang, T. Liu, N. Ahmed, et al. “VTR 7.0: next generation architecture and CAD system for FPGAs”. ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 7, No. 2, p. 6, 2014.

[Mahlk 01] S. Mahlke, R. Ravindran, M. Schlansker, and R. Schreiber. “Bitwidth Cognizant Architec- ture Synthesis of Custom Hardware Accelerators”. In: IEEE Trans. on Comput. Embed. Syst., 2001.

[Mahlk 92] S. A. Mahlke, D. C. Lin, and e. Chen. “Effective compiler support for predicated execution using the hyperblock”. In: ACM SIGMICRO, pp. 45–54, IEEE Computer Society Press, 1992. References 122

[McClu 85] E. J. McCluskey. “Built-in self-test techniques”. Design & Test of Computers, IEEE, Vol. 2, No. 2, pp. 21–28, 1985.

[McFar 88] M. C. McFarland, A. C. Parker, and R. Camposano. “Tutorial on High-level Synthesis”. In: Proceedings of the 25th ACM/IEEE Design Automation Conference, pp. 330–336, IEEE Computer Society Press, Los Alamitos, CA, USA, 1988.

[McNai 03] C. McNairy and D. Soltis. “Itanium 2 processor microarchitecture”. Micro, IEEE, Vol. 23, No. 2, pp. 44–55, 2003.

[Micro 14] MicroBlaze. “MicroBlaze Soft Processor Core”. 2014.

[Mishc 06] A. Mishchenko, S. Chatterjee, and R. Brayton. “DAG-aware AIG rewriting: A fresh look at combinational logic synthesis”. In: ACM/IEEE Design Automation Conf., pp. 532–536, 2006.

[Moore 65] G. E. Moore et al. “Cramming more components onto integrated circuits”. 1965.

[Nane 12] R. Nane, V. Sima, B. Olivier, R. Meeuws, Y. Yankova, and K. Bertels. “DWARV 2.0: A CoSy-based C-to-VHDL hardware compiler”. In: Field Programmable Logic and Applica- tions (FPL), 2012 22nd International Conference on, pp. 619–622, IEEE, 2012.

[Nicol 85] A. Nicolau. “Percolation scheduling: A parallel compilation technique”. Tech. Rep., Cornell University, 1985.

[Nicol 91a] A. Nicolau and R. Potasmann. “Incremental tree height reduction for high level synthesis”. In: DAC, 1991.

[Nicol 91b] A. Nicolau and R. Potasmann. “Incremental tree height reduction for high-level synthesis”. In: ACM/IEEE Design Automation Conference, pp. 770–774, ACM, 1991.

[Nios 09a] Nios II C2H Compiler User Guide. Altera, Corp., San Jose, CA, 2009.

[Nios 09b] I. Nios. “Processor Reference Handbook”. 2009.

[Nogue 11] J. Noguera, S. Neuendorffer, S. Haastregt, J. Barba, K. Vissers, and C. Dick. “Implemen- tation of sphere decoder for MIMO-OFDM on FPGAs using high-level synthesis tools”. Analog Integrated Circuits and Signal Processing, Vol. 69, No. 2-3, pp. 119–129, 2011.

[Occu 10] Occupational Outlook Handbook 2010-2011 Edition. United States Bureau of Labor Statis- tics, 2010.

[Open] OpenCL for Altera FPGAs. http://www.altera.com/products/software/opencl/ opencl- index.html.

[Openc 09] K. Opencl and A. Munshi. “The OpenCL Specification Version: 1.0 Document Revision: 48”. 2009.

[Pangr 87] B. M. Pangrle and D. D. Gajski. “Design tools for intelligent silicon compilation”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol. 6, No. 6, pp. 1098–1112, 1987. References 123

[Pangr 91] B. Pangrle. “On the Complexity of Connectivity Binding”. IEEE Tran. on Computer-Aided Design, Vol. 10, No. 11, November 1991.

[Pauli 89] P. Paulin and J. Knight. “Force-directed scheduling for the behavioral synthesis of ASICs”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol. 8, No. 6, pp. 661–679, Jun 1989.

[Pilat 11] C. Pilato, F. Ferrandi, and D. Sciuto. “A design methodology to implement memory ac- cesses in High-Level Synthesis”. In: Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2011 Proceedings of the 9th International Conference on, pp. 49–58, Oct 2011.

[Pilat 12] C. Pilato and F. Ferrandi. “Bambu: A Free Framework for the High-Level Synthesis of Complex Applications”. DATE, 2012.

[Potas 90] R. Potasman, J. Lis, A. Nicolau, and D. Gajski. “Percolation based synthesis”. In: Design Automation Conference, 1990. Proceedings., 27th ACM/IEEE, pp. 444–449, IEEE, 1990.

[Pothi 10] N. Pothineni, P. Brisk, P. Ienne, A. Kumar, and K. Paul. “A high-level synthesis flow for custom instruction set extensions for application-specific processors”. In: ACM/IEEE Asia and South Pacific Design Automation Conference, pp. 707–712, 2010.

[Pouch 13] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. “Polyhedral-based Data Reuse Opti- mization for Configurable Computing”. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 29–38, ACM, New York, NY, USA, 2013.

[Pozzi 06] L. Pozzi, K. Atasu, and P. Ienne. “Exact and approximate algorithms for the extension of embedded processor instruction sets”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25, No. 7, pp. 1209–1229, 2006.

[Putna 08] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and S. Eggers. “CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures”. In: IEEE Int’l Conf. on Field Programmable Logic and Applications, pp. 173–178, 2008.

[Qsys 14] Qsys interconnect. http://www.altera.com/products/ip/qsys/, 2014.

[Quar 14] Quartus II. Altera, Corp., San Jose, CA, 2014.

[Quinn 15] P. J. Quinn. “Silicon Innovation Exploiting Moore Scaling and More than Moore Technol- ogy”. In: High-Performance AD and DA Converters, IC Design in Scaled Technologies, and Time-Domain Signal Processing, pp. 213–232, Springer, 2015.

[Ramak 96] B. Ramakrishna Rau. “Iterative Modulo Scheduling”. The International Journal of Parallel Processing, Vol. 24, No. 1, Feb 1996.

[Ramal 99] G. Ramalingam, J. Song, L. Joskowicz, and R. E. Miller. “Solving systems of difference constraints incrementally”. Algorithmica, 1999. References 124

[Rau 81] B. R. Rau and C. D. Glaeser. “Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing”. SIGMICRO Newsl., Vol. 12, No. 4, pp. 183–198, Dec. 1981.

[Resha 05] M. Reshadi and D. Gajski. “A cycle-accurate compilation algorithm for custom pipelined datapaths”. In: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pp. 21–26, ACM, 2005.

[Rosen 05] L. Rosen. Open source licensing. Prentice Hall, 2005.

[Rupno 11] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen. “High level synthesis of stereo matching: Productivity, performance, and software constraints”. In: Field-Programmable Technology (FPT), 2011 International Conference on, pp. 1–8, Dec 2011.

[Santa 07] A. Santa Cruz. “Automated Generation of Hardware Accelerators From Standard C”. 2007.

[Schla 94] M. S. Schlansker and V. Kathail. “Acceleration of First and Higher Order Recurrences on Processors with ILP”. In: Work. on Lang. & Comp. for Par. Comp., 1994.

[Schre 02] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist, and M. Sivaraman. “PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators”. Journal of VLSI signal processing systems for signal, image and video technology, Vol. 31, No. 2, pp. 127–142, 2002.

[Semer 01] L. Semeria, K. Sato, and G. De Micheli. “Synthesis of hardware models in C with point- ers and complex data structures”. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, Vol. 9, No. 6, pp. 743–756, Dec 2001.

[Semer 98] L. Semeria and G. De Micheli. “SpC: synthesis of pointers in C application of pointer analysis to the behavioral synthesis from C”. In: Computer-Aided Design, 1998. ICCAD 98. Digest of Technical Papers. 1998 IEEE/ACM International Conference on, pp. 340–346, Nov 1998.

[Shan 13] Shang High-level Synthesis Framework. https://github.com/OpenEDA/Shang/, 2013.

[Silva 13] G. Q. Silva. Static Detection of Address Leaks. https://code.google.com/p/addr-leaks/, 2013.

[Sivar 02] M. Sivaraman and S. Aditya. “Cycle-time aware architecture synthesis of custom hardware accelerators”. In: Int’l conf. on Compilers, architecture, and synthesis for embedded systems, pp. 35–42, 2002.

[Stall 99] R. M. Stallman et al. Using and porting the GNU compiler collection. Free Software Foundation, 1999.

[Stan 14] Stanford CPU DB: Clock Frequency Scaling. http://cpudb.stanford.edu/visualize/ clock frequency, 2014.

[Steen 96] B. Steensgaard. “Points-to analysis in almost linear time”. In: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 32–41, ACM, 1996. References 125

[Stitt 07] G. Stitt and F. Vahid. “Binary synthesis”. ACM Transactions on Design Automation of Electronic Systems, Vol. 12, No. 3, 2007.

[Stra 10] Stratix-IV Data Sheet. Altera, Corp., San Jose, CA, 2010.

[Stra 14] Stratix-10. Altera, Corp., San Jose, CA, 2014.

[Sun 04] F. Sun, A. Raghunathan, S. Ravi, and N. Jha. “Custom-Instruction Synthesis for Extensible-Processor Platforms”. IEEE Transactions on Computer-Aided Design of In- tegrated Circuits and Systems, Vol. 23, No. 7, pp. 216–228, 2004.

[Synp 15] Synphony Model Compiler. http://www.synopsys.com/systems/blockDesign/HLS/ Pages/default.aspx, 2015.

[Syste 02] SystemC. “SystemC 2.0 User’s Guide”. Open SystemC Initiative, 2002.

[Taylo 13] M. B. Taylor. “Bitcoin and the Age of Bespoke Silicon”. In: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 16:1–16:10, Piscataway, NJ, USA, 2013.

[Tidwe 05] R. Tidwell. XAPP706: Alpha Blending Two Data Streams Using a DSP48 DDR Technique. Xilinx, Inc., 2005.

[Tige 10] Tiger ”MIPS” processor. University of Cambridge, http://www.cl.cam.ac.uk/teaching/ 0910/ECAD+Arch/mips.html, 2010.

[TOP5 14] TOP500: TOP 500 Supercomputer Sites. http://www.top500.org, Nov. 2014.

[Tripp 07] J. Tripp, M. Gokhale, and K. Peterson. “Trident: From High-Level Language to Hardware Circuitry”. Computer, Vol. 40, No. 3, pp. 28–37, 2007.

[Tseng 86] C.-J. Tseng and D. P. Siewiorek. “Automated synthesis of data paths in digital systems”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol. 5, No. 3, pp. 379–395, 1986.

[Under 04] K. D. Underwood and K. S. Hemmert. “Closing the gap: CPU and FPGA trends in sus- tainable floating-point BLAS performance”. In: Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pp. 219–228, IEEE, 2004.

[Vahid 08] F. Vahid, G. Stitt, and L. R. “Warp Processing: Dynamic Translation of Binaries to FPGA Circuits”. IEEE Computer, Vol. 41, No. 7, pp. 40–46, 2008.

[Villa 10] J. Villarreal, A. Park, W. Najjar, and R. Halstead. “Designing Modular Hardware Accel- erators in C with ROCCC 2.0”. In: IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 127–134, 2010.

[Virt 10] Virtex-4 Family Overview. Xilinx, Inc., San Jose, CA, 2010.

[Wakab 06] K. Wakabayashi and T. Okamoto. “C-based SoC design flow and EDA tools: an ASIC and system vendor perspective”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 19, No. 12, pp. 1507–1522, 2006. References 126

[Wang 13] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. “Memory Partitioning for Multidi- mensional Arrays in High-level Synthesis”. In: Proceedings of the 50th Annual Design Automation Conference, pp. 12:1–12:8, ACM, New York, NY, USA, 2013.

[Wang 14] Y. Wang, P. Li, and J. Cong. “Theory and Algorithm for Generalized Memory Parti- tioning in High-level Synthesis”. In: Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, pp. 199–208, ACM, New York, NY, USA, 2014.

[Warte 92] N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. “Enhanced Modulo Scheduling for Loops with Conditional Branches”. In: Proceedings of the 25th Annual International Symposium on Microarchitecture, pp. 170–179, IEEE Computer Society Press, Los Alamitos, CA, USA, 1992.

[Weick 84] R. P. Weicker. “Dhrystone: a synthetic systems programming benchmark”. Communica- tions of the ACM, Vol. 27, No. 10, pp. 1013–1030, 1984.

[Wilso 94] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, et al. “SUIF: An infrastructure for research on parallelizing and optimizing compilers”. ACM Sigplan Notices, Vol. 29, No. 12, pp. 31–37, 1994.

[Wilso 95] P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. “Dynamic storage allocation: A survey and critical review”. In: Memory Management, pp. 1–116, Springer, 1995.

[Xili] Xilinx: Vivado Design Suite. http://www.xilinx.com/products/design tools/ vivado/vivado-webpack.htm.

[Xili 14] Xilinx 2014 Annual Report (Form 10-K). http://www.xilinx.com, 2014.

[YACC 05] YACC-Yet Another CPU CPU. http://opencores.org/project,yacc,overview, 2005.

[Yang 14] D. Yang, C. Gan, P. R. Chidambaram, G. Nallapadi, J. Zhu, S. C. Song, J. Xu, and G. Yeap. “Technology-design-manufacturing co-optimization for advanced mobile SoCs”. March 28 2014.

[Yiann 07] P. Yiannacouras, J. Steffan, and J. Rose. “Exploration and Customization of FPGA-Based Soft Processors”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans- actions on, Vol. 26, No. 2, pp. 266–277, Feb 2007.

[Zhang 10] J. Zhang, Z. Zhang, S. Zhou, M. Tan, X. Liu, X. Cheng, and J. Cong. “Bit-level opti- mization for high-level synthesis and FPGA-based acceleration”. In: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pp. 59–68, ACM, 2010.

[Zhang 12] W. Zhang, V. Betz, and J. Rose. “Portable and Scalable FPGA-based Acceleration of a Direct Linear System Solver”. ACM Trans. Reconfigurable Technol. Syst., Vol. 5, No. 1, pp. 6:1–6:26, March 2012. References 127

[Zhang 13] Z. Zhang and B. Liu. “SDC-Based Modulo Scheduling for Pipeline Synthesis”. In: ICCAD, San Jose, CA, 2013.

[Zheng 13] H. Zheng, S. T. Gurumani, L. Yang, D. Chen, and K. Rupnow. “High-level synthesis with behavioral level multi-cycle path analysis”. In: Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, pp. 1–8, IEEE, 2013.

[Zhu 01] J. Zhu. “Static Memory Allocation by Pointer Analysis and Coloring”. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 785–790, IEEE Press, Piscataway, NJ, USA, 2001.

[Zhu 02] J. Zhu. “Symbolic Pointer Analysis”. In: Proceedings of the 2002 IEEE/ACM International Conference on Computer-aided Design, pp. 150–157, ACM, New York, NY, USA, 2002.

[Zhu 04] J. Zhu and S. Calman. “Symbolic Pointer Analysis Revisited”. SIGPLAN Not., Vol. 39, No. 6, pp. 145–157, June 2004. Appendix A

LegUp Source Code Overview

In this chapter, we give an overview of the LegUp source code. In Section A.1, we discuss the LegUp compiler backend pass, which receives the final optimized LLVM intermediate representation (IR) as input and produces Verilog as output. In Section A.2, we discuss the LegUp frontend compiler passes, which receive LLVM IR as input and produce modified LLVM IR as output.

A.1 LLVM Backend Pass

Most of the LegUp code is implemented as a target backend pass in the LLVM compiler framework. The top-level class is called LegupPass. This class gets run by the LLVM pass manager, which calls the method runOnModule() passing in the LLVM IR for the entire program and expecting the final Verilog code as output. The LegUp code is logically structured according to the flow chart we gave in Figure 2.2. There are five major logical steps performed in order: Allocation, Scheduling, Binding, RTL generation, and producing Verilog output. First, we have an Allocation class that reads in a user Tcl configuration script that specifies the target device, timing constraints, and HLS options. The class reads another Tcl script that contains the FPGA device specific operation delay and area characteristics. These Tcl configuration settings are stored in a global LegupConfig object accessable throughout the code. We pass the Allocation object to all the later stages of LegUp. This object also handles mapping LLVM instructions to unique signal names in Verilog and for ensuring these names do not overlap with reserved Verilog keywords. Global settings that should be readable from all other stages of LegUp should be stored in the Allocation class. The next step loops over each function in the program and performs HLS scheduling. The default scheduler uses the SDC approach and is implemented in the SDCScheduler class. The scheduler uses the SchedulerDAG class, which holds all the dependencies between instructions for a function. The final function schedule is stored in a FiniteStateMachine object that specifies the start and end state of each LLVM instruction. Next, we perform binding in the BipartiteWeightedMatchingBinding class, which performs bipartite weighted matching. We store the binding results in a datastructure that maps each LLVM instruction to the name of the hardware functional unit that the instruction should be implemented on. In the next step, the GenerateRTL class loops over every LLVM instruction in the program and using

128 Appendix A. LegUp Source Code Overview 129 the schedule and binding information, creates an RTLModule object that represents the final hardware circuit. The data structure that we use to represent an arbitrary circuit uses the following classes:

• RTLModule describes a hardware module.

• RTLSignal represents a register or wire signal in the circuit. The signal can be driven by multiple RTLSignals each predicated on an RTLSignal to form a multiplexer.

• RTLConst represents a constant value.

• RTLOp represents a functional unit with one, two or three operands.

• RTLWidth represents the bit width of an RTLSignal (i.e. [31:0]).

Finally, the VerilogWriter class loops over each RTLModule object and prints out the corresponding Verilog for the hardware module. We also print out any hard-coded testbenches and top-level modules.

A.2 LLVM Frontend Passes

In this section, we discuss portions of the LegUp code that are implemented as frontend LLVM passes. These passes receive LLVM IR as input and return modified LLVM IR as output and are run individually using the LLVM opt command. In the class implementing each pass, the LLVM pass manager will call the method runOnFunction() and provide the LLVM IR for the function and will expect the modified LLVM IR as output. For the hybrid flow, we remove all functions from the IR that should be implemented in software in the HwOnly class. We remove all functions that should be implemented in hardware with the SwOnly class. We run these two passes on the original IR of the program to generate two new versions of the IR. We pass the HwOnly IR to the LegUp HLS backend and the SwOnly IR to the MIPS/ARM compiler backend. Loop pipelining is performed by the SDCModuloScheduler class, which performs the algorithm we described in Chapter 5. This pass will determine the pipeline initiation interval and the scheduled start and end time of each instruction in the pipeline. This data is stored in LLVM IR metadata which can be read later by the LegUp backend. If-conversion is performed by the class LegUpCombineBB, which removes simple control flow and combines basic blocks. In the class PreLTO, we detect LLVM built-in functions that can exist in the IR (i.e. memset, memcpy). We replace these functions with equivalent LLVM IR instructions that we can synthesize in the LegUp backend.