Generation of Custom Run-Time Reconfigurable Hardware For

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Generation of Custom Run-time Reconfigurable Hardware for Transparent Binary Acceleration Nuno Miguel Cardanha Paulino Programa Doutoral em Engenharia Electrotécnica e de Computadores (PDEEC) Supervisor: João Canas Ferreira (Assistant Professor) Co-supervisor: João M. P. Cardoso (Associate Professor) June 2016 c Nuno Miguel Cardanha Paulino, 2016 i Abstract With the increase of application complexity and amount of data, the required computational power increases in tandem. Technology improvements have allowed for the increase in clock frequen- cies of all kinds of processing architectures. But exploration of new architecture and computing paradigms over the simple single-issue in-order processor are equally important towards increas- ing performance, by properly exploiting the data-parallelism of demanding tasks. For instance: superscalar processors, which discover instruction parallelism at runtime; Very Long Instruction Word processors, which rely on compile-time parallelism discovery and multiple issue-units; and multi-core approaches, which are based on thread-level parallelism exploited at the software level. For embedded applications, depending on the performance requirements or resource con- straints, and if the application is composed of well-defined tasks, then a more application-specific system may be appropriate, i.e., developing an Application Specific Integrated Circuit (ASIC). Custom logic delivers the best power/performance ratio, but this solution requires advanced hardware expertise, implies long development time, and especially very high manufacture costs. This work designed and evaluated a transparent binary acceleration approach, targeting Field Programmable Gate Array (FPGA) devices, which relies on instruction traces to automatically generate specialized accelerator instances. A custom accelerator, capable of executing a set of previously detected loop traces, is coupled to a host MicroBlaze processor. The traces are detected via simulation of the target binary. The approach does not require the application source code to be modified, which ensures the transparency for the application developer. No custom compilers are necessary, and the binary code does not need to be altered either offline or during runtime. The accelerators contain per-instance reconfiguration capabilities, which allow for the reuse of computing units between accelerated loops, without sacrificing the benefits of circuit specialization. To increase the achievable performance, the accelerator is capable of performing two concurrent memory accesses to the MicroBlaze’s data memory. The repetitive nature of the loop traces is exploited via loop-pipelining, which maximizes the achievable acceleration. By support- ing single-precision floating-point operations via fully-pipelined units, the accelerator is capable of executing realistic data-oriented loops. Finally, the use of Dynamic Partial Reconfiguration (DPR) allows for significant area savings when instantiating accelerators with numerous configurations, and also ensures circuit specialization per-configuration. Several fully functional systems were implemented, using commercial FPGAs, to validate the design iterations of the accelerator. An initial design relied on translating Control and Dataflow Graph representations of the traces into a multi-row array of Functional Units. For 15 benchmarks, the geometric mean speedup was 2.08×. A second implementation augmented the accelerator with shared memory access to the the entire local data memory of the MicroBlaze. Arbitrary addresses can be accessed without need for address generation hardware. Exploiting data-parallelism allows for targeting of larger, more realistic traces. The mean geometric speedup for 37 benchmarks was 2.35×. The most efficient implementation supports floating-point operations and relies on loop pipelining. The developed tools generate an accelerator instance by modulo-scheduling each trace at the minimum possible Initiation Interval. The geometric mean speedup for a set of 24 benchmarks is 5.61×, and the accelerator requires only 1.12× the FPGA slices required by the MicroBlaze. Finally, resorting to DPR, an accelerator with 10 configurations requires only a third of the Lookup Tables relative to an equivalent accelerator without this capability. To summarize, the approach is capable of expediently generating accelerator-augmented embedded systems which achieve considerable performance increases whilst incurring a low resource cost, and without requiring manual hardware design. ii Sumário Com o aumento da complexidade das aplicações, aumenta também a capacidade computacional necessária. Melhorias tecnológicas têm permitido o aumento da frequência de relógio para todo o tipo de arquitecturas computacionais. Contudo, a exploração de novas arquitecturas é igualmente importante para melhorias de desempenho, explorando eficientemente o paralelismo de dados de tarefas exigentes. Por exemplo: processadores superscalar, que descobrem paralelismo ao nível da instrução durante a execução; processadores VLIW (Very Large Instruction Word), que dependem de paralelismo explorado durante a compilação e de várias unidades em paralelo; e tecnologias multi-core, que se baseiam na exploração de paralelismo ao nível da thread através de software. Para aplicações embebidas, dependendo dos requisitos de desempenho ou restrições de recur- sos, e se a aplicação for composta por tarefas bem definidas, um sistema mais especifico poderá ser mais apropriado, i.e., desenhar um ASIC (Application Specific Integrated Circuit). Lógica dedi- cada beneficia do melhor desempenho por watt, mas esta solução requer experiência de desenho de hardware, sofre de tempo de desenvolvimento longo, e custos de fabricação elevados. Este trabalho desenvolveu e avaliou uma abordagem de aceleração transparente de código binário, orientada especificamente para FPGAs. A abordagem baseia-se em sequências frequentes de instruções executadas (i.e, traces) para gerar automaticamente aceleradores especializados. Um acelerador, capaz de executar um conjunto de traces previamente detectados, complementa um MicroBlaze, que age como processador principal. Os traces são detectados por simulação da aplicação, e representam ciclos de execução (i.e., loops). Não é necessário modificações ao código-fonte, o que aumenta a transparência da abordagem para o programador. Um compilador especializado não é necessário, e o código binário não sofre modificações pós-compilação. O acelerador contém lógica de reconfiguração especializada, o que permite a reutilização de unidades de cálculo entre os loops acelerados, sem sacrificar a especialização. Para maximizar o desempenho, o acelerador é capaz de efectuar dois acessos paralelos à memória de dados do processador. O uso de loop-pipelining maximiza a aceleração, e o suporte para operações de vírgula flutuante permite a execução de tarefas embebidas realistas. Finalmente, o uso de Reconfiguraçao Parcial Dinâmica (DPR), reduz significativamente a área necessária para suportar várias configu- rações, e assegura a especialização do hardware respectivo a cada configuração. Vários sistemas totalmente funcionais foram implementados para validar os aceleradores, us- ando FPGAs comerciais. Um primeira implementação traduz representações de grafo de dados (i.e., Control and Dataflow Graph) dos loops para várias linhas de unidades funcionais interli- gadas. Para 15 benchmarks, a média geométrica da aceleração foi de 2.08×. Uma segunda im- plementação adiciona ao acelerador suporte para acesso à memória. O mesmo é capaz de aceder directamente a toda a memória local de dados do MicroBlaze, sendo suportados acessos endereços arbitrários. Explorar o paralelismo de dados permite acelerar loops mais realísticos. Para 37 benchmarks, a média geométrica da aceleração foi de 2.35×. A implementação mais eficiente su- porta operações de vírgula flutuante e utiliza loop-pipelining. Um escalonador gera uma instância do acelerador efectuando modulo-scheduling para cada trace ao intervalo de iniciação (i.e., Initia- tion Interval) mínimo. A média geométrica da aceleração é de 5.61× em média para 24 aplicações, e os aceleradores requerem 1.12× o número de slices de FPGA de um MicroBlaze. Finalmente, com o uso de DPR, um acelerador com 10 configurações necessita apenas de um terço das Lookup Tables relativamente a um acelerador equivalente sem esta capacidade. Concluindo, a abordagem permite gerar rapidamente sistemas com aceleradores especializados que aumentam consideravelmente o desempenho, com um custo reduzido em termos de recur- sos, evitando também a necessidade de desenho de hardware manual. Acknowledgments This thesis is the result of four years of work that I was fortunate enough to be able to carry out with focus and nearly undivided attention due mostly, if not totally, to the support of my parents, who helped keep my mind of other, more time consuming, and infinitely less productive matters. The individual moments where someone lent me their support, and particular people who took the time to care and ask about my work, are too numerous to list. A thank you to the friends that shared a seat in the lab where I sit as I type this. It would go without saying, but I’d also like to thank my supervisor, João Canas Ferreira, for his guidance and insight which helped make sure I didn’t stray off into the distance on a random direction. Also, a thank you to my co-supervisor, João Manuel Paiva Cardoso, for many fruitful discussions, and another

Generation of Custom Run-Time Reconfigurable Hardware For

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Multiprocessing Contents

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

CSC 256/456: Operating Systems

Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

Efficient Synchronization Mechanisms for Scalable GPU Architectures

1 Introduction

Focused on Embedded Systems

CIS 501 Computer Architecture This Unit: Shared Memory