Phd Thesis, Norwegian University of Science and Technology, 2013

i “phdthesis” — 2015/12/14 — 9:35 — page 1 — #1 i i i POLITECNICO DI MILANO DEPARTMENT OF INFORMATION TECHNOLOGY DOCTORAL PROGRAMME IN COMPUTER SCIENCE AND ENGINEERING ON THE ROLE OF POLYHEDRAL ANALYSIS IN HIGH PERFORMANCE RECONFIGURABLE HARDWARE BASED COMPUTING SYSTEMS Doctoral Dissertation of: Riccardo Cattaneo Advisor: Prof. Marco D. Santambrogio Tutor: Prof. Donatella Sciuto The Chair of the Doctoral Program: Prof. Carlo Fiorini i i i i i “phdthesis” — 2015/12/14 — 9:35 — page 2 — #2 i i i 2015 – XVIII 2 i i i i i “phdthesis” — 2015/12/14 — 9:35 — page 1 — #3 i i i Acknowledgements 1 i i i i i “phdthesis” — 2015/12/14 — 9:35 — page 2 — #4 i i i i i i i i “phdthesis” — 2015/12/14 — 9:35 — page I — #5 i i i Summary ECENT years have seen dramatic improvements in High Level Synthesis (HLS) tools: from efficient translation of numeric algorithms to image processing algorithms, from stencil com- Rputations to neural networks, relevant application domains and in- dustries are benefiting from research on compute and/or memory/- communications network. In order to systematically synthesize better circuits for specific programs and kernels, last decades’ studies focused on the development of sound, formal approaches; one notable such framework is the Polyhedral Model (PM) and the associated code analysis technique, collectively called Polyhedral Analysis (PA). Under this rep- resentation it is possible to compute dependencies, find loop bounds, and reorder instructions in a completely automated manner relying on the same set of sound and comprehensive assumptions of PM. We reconsider this rich state of the art, and elaborate on different methodologies and implement the related toolchains to gener- ate highly parallel and power efficient kernels running on reconfigurable hardware, with the aim of distributing the workload on multiple computational blocks while maximizing the overall power efficiency of the resulting heterogeneous system. We approach this problem in three different, interrelated ways. First of all, we dedicated our early efforts to the development of an accelerator-rich platform where the focus is on the coordina- tion of multiple custom and software processors, via a novel Do- I i i i i i “phdthesis” — 2015/12/14 — 9:35 — page II — #6 i i i main Space Exploration (DSE) phase. Specifically, the work elabo- rates on two relevant aspects: the effectiveness of Partial Reconfigu- ration (PR) to attain improved energy delay and throughput metrics, and the effectiveness of the heuristics chosen to realize the DSE step, which feature both low complexity and good exploration times. To this matter, we devote Chapter 2. We also extended the scope of this work by assuming that multiple computing elements are in place, and an adequate communication architecture is required to coordi- nate those accelerators. This is discussed in Chapter 3. Secondly, we focus on amply data-parallel codes, and develop a novel HLS approach to using PM as a means to explicitly extract and isolate data and computation from affine codes in order to efficiently divide the workload among an arbitrary number of nodes, in the light of the current and foreseeable trend of adoption of reconfigurable hardware in the datacenter; towards energy proportional computing, we improve the current state of art in single core accel- eration, as our methodology obtains near-linear speedup with the area at disposition to accelerate the given workload. To this subject, we devote Chapter 4 and 5. Lastly, we focus on a specific, and more restricted class of data parallel codes, namely Iterative Stencil Loop (ISL), as they play a crucial role in a variety of different fields of application. The com- putationally intensive nature of those algorithms created the need for solutions to efficiently implement them in order to save both execution time and energy. This, in combination with their regular structure, has justified their widespread study and the proposal of largely different approaches to their optimization. However, most of these works are focused on aggressive compile time optimization, cache locality optimization, and parallelism extraction for the mul- ticore/multi processor domain, while fewer works are focused on the exploitation of custom architectures to further exploit the regular structure of ISLs, specifically with the goal of improving power efficiency. This work introduces a methodology to systematically de- sign power efficient hardware accelerators for the optimal execution of ISLs algorithms on Field Programmable Gate Arrays (FPGAs). To this extensive methodology, we devote Chapter 6. As part of the methodology, we introduce the notion of Stream- ing Stencil Time-step (SST), a streaming-based architecture capable of achieving both low resource usage and efficient data reuse thanks to an optimal data buffering strategy; and we introduce a technique, II i i i i i “phdthesis” — 2015/12/14 — 9:35 — page III — #7 i i i called SSTs queuing, capable to deliver a quasi-linear execution time speedup with constant bandwidth. This is the core subject of Chap- ter 7. We validate all methodologies, approaches, and toolchains on significant benchmarks on either a Zynq-7000 or Virtex-7 FPGA using the Xilinx Vivado suite. Results, which are reported in each Chap- ter devoted to the respective subject, demonstrate how we are able to improve the efficiency of all the baselines we compare against; specifically, we improve the energy delay and throughput metric when using our accelerator-rich platform, and dramatically improve the on-chip memory resources usage using the PA-based methodologies, allowing us to treat problem sizes whose implementation would otherwise not be possible via direct synthesis of the original, unmanipulated HLS code. We finally conclude the dissertation in Chapter 8 with a number of potential future works. III i i i i i “phdthesis” — 2015/12/14 — 9:35 — page IV — #8 i i i i i i i i “phdthesis” — 2015/12/14 — 9:35 — page V — #9 i i i Sommario URANTE gli ultimi anni gli strumenti di sintesi ad alto liv- ello (High Level Synthesis (HLS)) hanno conosciuto notevoli miglioramenti: dalla traduzione efficiente di algoritmi nu- Dmerici a quella degli algoritmi di elaborazione delle immagini, dai calcoli stencil alle reti neurali, disparati domini applicativi ed indus- trie stanno beneficiando della ricerca sulla sintesi di elementi di calcolo, sottosistemi di memoria, e reti di comunicazione. La ricerca degli ultimi decenni si é focalizzata sullo sviluppo di approcci formali e corretti finalizzati alla sistematica sintesi di cir- cuiti piú performanti per specifici programmi e nuclei computazion- ali complessi. In tale senso é di notevole interesse il Modello Poliedrale (Polyhedral Model (PM)) e la tecnica di analisi codice associata, cui ci si riferisce collettivamente con il termine Analisi Poliedrale (Poly- hedral Analysis (PA)), descritta con dovizia di particolari nel capitolo 4. Tramite questa rappresentazione e manipolazione della computazione é possibile calcolare le dipendenze, i gli indici degli es- tremi delle iterazioni, riordinare le istruzioni in modo completamente automatizzato, ed altro ancora. In questa tesi riconsideriamo lo stato dell’arte ed i lavori relativi a quest’area di ricerca, elaborando metodolo- gie innovative ed implementando le relative toolchain per generare hardware riconfigurabile, con l’obiettivo di distribuire il carico di lavoro su piú blocchi di calcolo, massimizzando l’efficienza energetica complessiva di il sistema eterogeneo risultante. Approcciamo il problema in tre modi differenti. V i i i i i “phdthesis” — 2015/12/14 — 9:35 — page VI — #10 i i i Prima di tutto, abbiamo sviluppato una piattaforma basata su acceleratori in cui l’attenzione é rivolta al coordinamento di piú pro- cessori e software personalizzati, attraverso una fase di esplorazione dello spazio dei parametri (Domain Space Exploration (DSE)). Il lavoro approfondisce due aspetti rilevanti: primo, l’efficacia della ri- configurazione parziale (Partial Reconfiguration (PR)) per ottenere un miglioramento delle metriche prodotto energia-latenza e throughput; secondo, l’efficacia delle euristiche scelte per realizzare il passo DSE, di bassa complessitá computazionale e con ottimi tempi di es- ecuzione. A questo lavoro dedichiamo il capitolo 2. Abbiamo anche ampliato la portata dello stesso assumendo che piú elementi di calcolo siano presenti nel sistema, e sia necessario un’adeguata architet- tura di comunicazione al fine di coordinare tali acceleratori. Questo é discusso nel capitolo 3. Successivamente ci concentriamo sui codici affini ampiamente paralleli, e sviluppiamo un nuovo approccio alla sintesi ad alto liv- ello (HLS) per utilizzare l’analisi poliedrale come un mezzo per es- trarre ed isolare flusso dati e computazione, per poter dividere in modo efficiente il carico di lavoro tra un numero arbitrario di nodi alla luce della tendenza attuale e futura di adozione di hardware riconfigurabile nei centri di elaborazione dati; nell’ottica dei sistemi di calcolo energy proportional, miglioriamo l’attuale stato dell’arte in termini di accelerazione single core, come dimostra la metodologia descritta nel capitolo 5. Infine, ci concentriamo su una classe specifica e ristretta di codici paralleli, gli iterative Stencil Loop (ISL), codici la cui computazione efficiente risulta cruciale in una varietá di campi di applicazione. La natura computazionalmente intensiva di questi algoritmi ha gen- erato la necessitá di disporre di metodi efficienti per la loro computazione al fine di risparmiare sia tempo che energia per la loro es- ecuzione. Questo, in combinazione con la loro struttura regolare, ha giustificato il loro ampio studio e la proposta di molteplici approcci per loro ottimizzazione. Tuttavia, la maggior parte di queste opere sono focalizzate sull’ottimizzazione statica aggressiva, l’ottimizzazione della localitá dei dati, e l’estrazione di parallelismo nel dominio mul- ticore, mentre un minor numero di lavori sono focalizzati sullo sfrut- tamento di architetture riconfigurabili con l’obiettivo di migliorare l’efficienza energetica della computazione.

Phd Thesis, Norwegian University of Science and Technology, 2013

CUDA Particle-Based Fluid Simulation

NVIDIA Physx SDK EULA

Multiprocessing Contents

Nvidia Corporation 2016 Annual Report

AMD Pcie® ADD-IN BOARD

Inside Chips

Corporate Backgrounder

Datasheet GFX-N3A3-V5LMS1

2014 Annual Review Notice of Annual Meeting Proxy Statement and Form 10-K Nvidia Corporation Nvidia

Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget

Accelerator Architectures —A Ten-Year Retrospective

Page 1 of 4 NVIDIA CORPORATION NVIDIA Physx DRIVER END USER