Acceleration of Applications with FPGA-Based Computing Machines: Code Restructuring

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Acceleration of Applications with FPGA-Based Computing Machines: Code Restructuring Tiago Lascasas dos Santos Mestrado Integrado em Engenharia Informática e Computação Supervisor: João M. P. Cardoso, PhD Co-supervisor: João Bispo, PhD July 31, 2020 c Tiago Santos, 2020 Acceleration of Applications with FPGA-Based Computing Machines: Code Restructuring Tiago Lascasas dos Santos Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: Prof. João Paulo de Castro Canas Ferreira, PhD External Examiner: Dr. José Gabriel de Figueiredo Coutinho, PhD Supervisor: Prof. João Manuel Paiva Cardoso, PhD July 31, 2020 Abstract Field-programmable gate arrays (FPGAs) can be used to accelerate performance-critical programs from a wide range of fields and still providing energy-efficient solutions. Programs written in high level languages, such as C and C++, can be compiled to FPGAs through High-level Synthesis (HLS). Although FPGAs benefit the most from parallel and data-streaming applications, efficient compilation to FPGAs is a problem for both tools and developers. Most applications do not follow these patterns, and extensive code restructuring and the use of HLS directives need to be applied to a program in order to take advantage of FPGAs. Code restructuring and the use of HLS directives often needs to be manually performed by an experienced developer, and as such there is a need to automate this process. This dissertation proposes a framework that automatically optimizes C code via directives, using a source-to-source compiler on a stage prior to HLS. This optimization is primarily applied by strategies that select, configure and insert directives on the code to be input to an HLS tool, e.g., Vivado HLS, in order to synthesize more efficient hardware accelerators. Those strategies rely on very simple but effective heuristics, which use a small set of properties extracted from the control/dataflow graphs generated from the computations being compiled. The framework is evaluated using a wide variety of input source code, and the results show that the framework manages to achieve efficient speedups across all benchmarks when compared to their unoptimized versions, while maintaining a low resource usage in most cases. The framework is also compared to code optimized manually with directives, and the experiments show that it achieves similar results. Keywords: FPGAs, Code Optimization, High-level Synthesis, Source-to-Source Compiler i ii Resumo Os dispositivos lógicos programáveis conhecidos por FPGAs podem ser usados para acelerar programas com requisitos de performance exigentes e provenientes de uma vasta gama de domínios, com a vantagem adicional de manter uma alta eficiência energética. Programas escritos em lin- guagens de alto nível, como C e C++, podem ser compilados para FPGAs através da Síntese de Alto Nível (HLS). Apesar de os FPGAs tirarem melhor partido de aplicações com paralelismo e de fluxo de dados, a compilação eficiente para estas plataformas é um problema tanto para as ferramentas como para os programadores. Visto que maior parte das aplicações não seguem estes padrões, uma reestruturação de código extensiva e a aplicação de diretivas de HLS têm de ser aplicadas a um programa de modo a se poder tirar partido do FPGA. A reestruturação de código e a inserção de diretivas, normalmente, têm de ser efetuadas manualmente por um programador com experiência e, portanto, existe a necessidade de se automatizar este processo. Esta dissertação propõe uma ferramenta que otimiza código C automaticamente num compilador source-to-source numa fase anterior à HLS. Esta otimização é, maioritariamente, aplicada através de estratégias que selecionam, configuram e inserem diretivas no código que serve de input a uma ferramenta de HLS, p. ex. Vivado HLS, de modo a sintetizar aceleradores mais eficientes. A ferramenta é avaliada usando uma grande variedade de códigos fonte, e os resultados mostram que a ferramenta consegue atingir ganhos eficientes de latência quando comparados com as suas versões não otimizadas, mantendo uma utilização de recursos baixa na maior parte dos casos. A ferramenta também é comparada com código otimizado manualmente com diretivas, e as experiências mostram que os resultados são semelhantes aos obtidos com a inserção manual de diretivas. Palavras-chave: FPGAs, Otimização de Código, Síntese de Alto Nível, Compiladores fonte-para- fonte iii iv Acknowledgments I would, first and foremost, like to express my absolute gratitude to my supervisor, João Cardoso, for all the constant guidance, availability, corrections and feedback provided during this dissertation. It was an honor to be able to work with you. A very special thanks, too, to my co-supervisor, João Bispo, for always being there to receive the oddest of questions at the oddest of hours, and for always going the extra mile in his answers. My thanks, as well, to the other folks in the SPeCS lab: to Renato Campos, for the constant exchange of ideas and for always being there to help me out; to Luís Noites, for those discussions during the early days of this dissertation; to Pedro Pinto, for laying the groundwork that I would then build upon; and to the rest of the lab members, for creating a stimulating and pleasant environment to work in. This whole pandemic ordeal made it quite difficult to feel immersed in a laboratorial environment, but regardless, I am very glad to have been able to work in this organization. A special thanks, too, to Afonso Ferreira, whose previous advancements on this topic are incredibly relevant to my work, and needless to say, inspired it. I also acknowledge the academic program of Xilinx, Inc. for providing the lab with FPGA boards and licenses of the tools. As for my family, I would like to thank my parents, Francisco and Diamantina, for the endless patience they’ve shown to me during this endeavor. I know that hearing me type on the keyboard deep into the night was a real nuisance. I also want to thank my brother, Gonçalo, who was always there to cheer me up. I look forward to read your own dissertation in a few years’ time. Finally, I would like to dedicate this dissertation to my grandmothers, Delfina and Emília. The other day I found my kindergarten yearbook, in which you both left me messages for the future. Your eagerness for me to find success radiated in them, and so I’d like to acknowledge that by dedicating this work to you. Tiago Lascasas dos Santos v vi ”Always an even trade” Steven Erikson vii viii Contents 1 Introduction1 1.1 Objectives . .2 1.1.1 Problem Description . .2 1.1.2 Proposed Solution . .3 1.2 Structure of the Dissertation . .4 2 Related Work5 2.1 High-Level Synthesis Tools . .5 2.1.1 Overview . .5 2.1.2 Vivado HLS . .7 2.1.3 LegUp . .7 2.2 State-of-the-art Code Restructuring Strategies . .7 2.2.1 Design Space Exploration (DSE) . .7 2.2.2 Ordered Loop Transformations . .9 2.2.3 Generating Optimized Code From Tracing-based Data Flow Graphs . 11 2.2.4 Conversion of Code Into a Data Flow Engine . 12 2.2.5 Abstractions for Targeting Different HLS Tools . 13 2.2.6 Balancing the Latency, Area and Accuracy Trade-off . 14 2.2.7 Use of Aspect-Oriented Programming . 15 2.3 Summary . 16 3 The Code Analysis Framework 19 3.1 The Clava Source-to-Source Compiler . 19 3.2 Source Code Preprocessing . 22 3.2.1 Constant Folding . 22 3.2.2 Unwrapping Statements With Multiple Delarations . 23 3.3 Tracing-based Data Flow Graph Through Instrumentation . 24 3.4 Data Flow Graph Through Static Code Analysis . 30 3.4.1 Control Flow Graph . 30 3.4.2 Data Flow Graph . 33 3.4.3 Building Process . 38 3.4.4 Loop Iteration Estimation . 42 3.4.5 Simplifying the Data Flow Graph . 43 3.4.6 Extracting Features . 44 3.5 Summary . 48 ix x CONTENTS 4 Code Optimization Strategies 49 4.1 Restructuring Using Vivado HLS Directives . 50 4.1.1 Function Inlining . 50 4.1.2 Loop Unrolling . 51 4.1.3 Pipelining Code Regions . 52 4.1.4 Unrolling and Pipelining Single Loops . 54 4.1.5 Declaring Array Parameters as Streams . 55 4.1.6 Support For Other Directives . 57 4.2 Detecting Instrumentable Code . 60 4.3 Mathematical Functions Replacement . 61 4.3.1 Statistical Analysis of Benchmarks . 62 4.3.2 Float Versions of Mathematical Functions . 63 4.3.3 Optimizing Individual Functions . 64 4.4 Summary . 65 5 Experimental Results 67 5.1 Environment Setup . 67 5.2 Global Framework Evaluation . 67 5.2.1 Machine Learning Benchmarks . 70 5.2.2 Matrix Manipulation Benchmarks . 71 5.2.3 Filter-based Benchmarks . 73 5.2.4 Digital Signal Processing Benchmarks . 74 5.3 User-specified Load/Stores Parameter for Single Loops . 75 5.4 Comparison to Manual Code Restructuring . 79 5.5 Instrumentable Code Classification Accuracy . 82 5.6 Execution Time Evaluation . 83 5.7 Summary . 83 6 Conclusions 85 6.1 Concluding Remarks . 85 6.2 Future Work . 86 A Framework API 87 B Benchmark Code 91 B.1 Machine Learning Benchmarks . 91 B.2 Matrix Manipulation Benchmarks . 95 B.3 Filter-based Benchmarks . 97 B.4 Digital Signal Processing Benchmarks . 99 B.5 Load/Stores Evaluation . 101 B.6 Comparison to Manual Code Restructuring . 102 References 103 List of Figures 1.1 Example of a possible compilation pipeline for a C program targeting an FPGA .2 1.2 Approach adopted by the proposed framework . .3 2.1 Different LARA use cases: A) one aspect applied to one app produces one design; B) multiple aspects applied to one app produce multiple apps; C) one aspect applied to multiple apps produces multiple apps; D) one aspect, configured with user parameters and applied to one app, produces multiple apps ........

Acceleration of Applications with FPGA-Based Computing Machines: Code Restructuring

Linux DVB API Version 4

Windows Internals, Sixth Edition, Part 2

Implementing IBM Flashsystem 900 Model AE3

Diploma Thesis RTAI Networking Mailboxes

Locks and Raspberries: a Comparative Study of Single-Board Computers for Access Control

Introduction to 64 Bit Intel Assembly Language Programming for Linux

The Potential of Dynamic Binary Modification and CPU- FPGA Socs for Simulation

Build Your Own Ambilight - 11

Reactos Suspends Development for Source Code Review - Linux.Com Reactos Suspends Development for Source Code Review by - February 1, 2006

A Crack on the Glass Evaluation of Consumer Windows OS Security Architecture and Its Feasibility As a Potential Isolation Provider for "Qubes Windows Lite"

Free and Open Source Software

Partial Array Self-Refresh in Linux