Reconfigurable Accelerators in the World of General-Purpose Computing

Reconfigurable Accelerators in the World of General-Purpose Computing Dissertation A thesis submitted to the Faculty of Electrical Engineering, Computer Science and Mathematics of Paderborn University in partial fulfillment of the requirements for the degree of Dr. rer. nat. by Tobias Kenter Paderborn, Germany August 26, 2016 Acknowledgments First and foremost, I would like to thank Prof. Dr. Christian Plessl for the advice and support during my research. As particularly helpful, I perceived his ability to communicate suggestions depending on the situation, either through open questions that give room to explore and learn, or through concrete recommendations that help to achieve results more directly. Special thanks go also to Prof. Dr. Marco Platzner for his advice and support. I profited especially from his experience and ability to systematically identify the essence of challenges and solutions. Furthermore, I would like to thank: • Prof. Dr. João M. P. Cardoso, for serving as external reviewer for my dissertation. • Prof. Dr. Friedhelm Meyer auf der Heide and Dr. Matthias Fischer for serving on my oral examination committee. • All colleagues with whom I had the pleasure to work at the PC2 and the Computer Engineering Group, researchers, technical and administrative staff. In a variation to one of our coffee kitchen puns, I’d like to state that research without colleagues is possible, but pointless. However, I’m not sure about the first part. • My long-time office mates Lars Schäfers and Alexander Boschmann for particularly extensive discussions on our research and far beyond. • Gavin Vaz, Heinrich Riebler and Achim Lösch for intensive and productive collabo- ration on joint research interests. • The Bachelor and Master students and student assistants I have supervised, for their enthusiasm and for helping me to understand potential and limitations of the different approaches in their projects. The technical contributions of Henning Schmitz have particularly contributed to my research. • The Deutsche Forschungsgemeinschaft (DFG) and the Intel Labs Braunschweig for funding different phases of my research in the Collaborative Research Centre (CRC) 901 On-The-Fly Computing and the Project Multimodal Reconfigurable Processing Unit (MM-RPU). iii • The colleagues in the CRC 901 for insightful discussions. • Michael Kauschke and Matthias Gries for contributing an industry perspective on my research. Finally, I would like to thank my family for their encouragement and moral support. Special thanks go to Gaby Nordendorf. If anyone was ever more convinced than me that I would eventually complete this thesis, it was her. iv Abstract Growing and newly emerging computing workloads and markets keep pushing computer architecture forward. Power limitations and diminishing returns from wider and more parallel processors are driving the need for architectural innovations. By reducing overheads and customizing parallelism, specialized accelerators help to increase the performance of specific workloads efficiently. However, without programmability, they lack the flexibil- ity for general-purpose computing and thus can’t profit from shared costs among different workloads, users and markets. The architecture of field programmable gate arrays (FPGAs) combines programmability with a high potential for specialization for different workloads. The main obstacle for FPGA adoption in general-purpose computing is the lack of productive methods and tools for designers and maintainers of implementations running entirely or partially on FPGAs. In this work, by analyzing current approaches to tackle this productivity challenge along with their conceptual and practical trade-offs, we identify three pillars that can complement each other to jointly drive general-purpose adoption of FPGAs. These pillars combine, firstly, synthesis from parallel OpenCL designs, secondly, fast and automatic compilation targeting overlay architectures on FPGAs, and thirdly, the encapsulation of hand-optimized FPGA designs into application- or domain-specific libraries. In this thesis, we focus on overlay architectures as one of theses paths towards productive FPGA design processes. A large variety of such architectures have been presented over the last years, but for most overlays it was poorly understood, which overheads they involve compared to custom designs implemented directly on FPGA resources. Our work quantifies such overheads with a diverse set of program loops from a state-of-the-art stereo- matching application for an instruction-programmable overlay. It demonstrates that the architecture can, despite overheads, serve as a practically usable accelerator, and identifies specific differences to fully customized FPGA designs that may help to reduce overheads through overlay customization. Even though one motivating aspect for related research on overlay architectures has been their potential for fast compilation or synthesis and for quick reconfiguration, in order to demonstrate their practical usefulness to raise productivity when aiming at FPGA acceleration, we had to go one step further with regard to the design entry. To this end, we present a fast tool flow that automatically extracts suitable loops from a high-level source or binary code and offloads them to a vector coprocessor realized as an FPGA overlay. v Besides the focus on productivity to foster FPGA adoption, in order to fit into established computing systems and markets, FPGAs need to architecturally coexist and cooperate with general-purpose processors. With a high-level performance estimation model that takes into account the interdependency of architectures and program designs, we explore the design space for such integrated systems. We highlight that integration of the memory hierarchy is not only helpful to reduce application design efforts, but also has considerable influence on the acceleration potential of the platform. Recent, high-profile trends in industry show interesting correspondences to our analysis. When the hardware integration of FPGA accelerators proceeds on this path, it will be foremost the interplay of design productivity and performance potential that governs the success of FPGAs in general- purpose computing. With our analysis and technical contributions in that field, we may help to shape the outcome of this process. vi Zusammenfassung Der Bedarf an immer höherer Rechenleistung für wachsende und neu aufkommende Rechen- lasten und Märkte ist eine Herausforderung für die Rechnerarchitektur. Grenzen bei der Leistungsaufnahme und sinkende Erträge durch größere und zunehmend parallele Prozes- soren machen Architekturinnovationen notwendig. Durch Effizienzsteigerungen und durch individuell angepasste Parallelität können spezialisierte Beschleuniger dazu beitragen, die Rechenleistung für bestimmte Rechenlasten zu erhöhen. Ohne Programmierbarkeit fehlt ihnen jedoch die Flexibilität für allgemeine Rechenaufgaben und sie können somit nicht davon profitieren, Kosten auf verschiedene Nutzungsszenarien und Märkte zu verteilen. Die Architektur von FPGAs, eine bestimmte Variante programmierbarer Logikbausteine, kombiniert Programmierbarkeit mit einem hohen Potenzial zur Spezialisierung für verschiedene Rechenlasten. Das größte Hindernis auf dem Weg zu einem verbreiteten Einsatz von FPGAs für allgemeine Rechenaufgaben ist der Mangel an produktiven Methoden und Werkzeugen zur Entwicklung und Wartung von Implementierungen, die ganz oder teil- weise auf FPGAs ausgeführt werden. Wir analysieren in dieser Arbeit aktuelle Ansätze, die Produktivität bei der Entwicklung von FPGA-Anwendungen zu steigern, und arbeiten konzeptuelle und praktische Vor- und Nachteile heraus. Darauf aufbauend identifizieren wir drei Säulen, die einander dabei ergänzen können, die Verwendung von FPGAs für allgemeine Rechenaufgaben voranzutreiben. Diese Säulen kombinieren erstens eine Konfigu- rationsgenerierung ausgehend von parallelen OpenCL-Implementierungen, zweitens eine schnelle und automatisierte Übersetzung für übergelagerte Architekturen auf FPGAs, und drittens die Zusammenfassung von manuell optimierten FPGA-Konfigurationen zu anwendungs- oder bereichsspezifischen Bibliotheken. In dieser Ausarbeitung konzentrieren wir uns auf übergelagerte Architekturen als Ansatz für produktive Entwicklungsprozesse für FPGAs. Im Laufe der letzten Jahre wurde eine Vielzahl solcher Architekturen vorgestellt, die durch eine Zwischenschicht das Abstrak- tionsniveau von FPGAs erhöhen. Allerdings existierte für die meisten dieser überge- lagerten Architekturen nur ein unzureichendes Verständnis von Flächenmehrverbrauch oder reduzierten Rechenleistungen im Vergleich zu spezialisierten Konfigurationen die den FPGA ohne Zwischenschicht nutzen. Unsere Arbeit quantifiziert für eine instruktions- basierte übergelagerte Architektur solche Nachteile anhand einer Auswahl unterschied- licher Programmschleifen, die zu einer modernen Anwendung für stereoskopischen Bild- abgleich gehören. Wir zeigen, dass diese Architektur trotz dieser Nachteile als praktisch vii nutzbarer Beschleuniger dienen kann und identifizieren verschiede Unterschiede im Ver- gleich zu vollständig spezialisierten Konfigurationen. Dies könnte zukünftig helfen, durch Spezialisierung der übergelagerten Architekturen, deren Nachteile weiter zu reduzieren. Das Potenzial zur schnellen Übersetzung oder Konfigurationsgenerierung für übergelagerte Architekturen, sowie die Möglichkeit schnell Konfigurationen auszutauschen, war bereits ein Anreiz für die bestehende Forschung diesem Bereich. Um allerdings zu demonstrieren,

Reconfigurable Accelerators in the World of General-Purpose Computing

AMD Athlon™ Processor X86 Code Optimization Guide

The Central Processing Unit(CPU). the Brain of Any Computer System Is the CPU

Benchmark Evaluations of Modern Multi Processor Vlsi Ds Pm Ps

The Intel Microprocessors: Architecture, Programming and Interfacing Introduction to the Microprocessor and Computer

Computer Organization & Architecture Eie

Lecture Notes

(12) Patent Application Publication (10) Pub. No.: US 2011/0231717 A1 HUR Et Al

Arithmetic and Logical Unit Design for Area Optimization for Microcontroller Amrut Anilrao Purohit 1,2 , Mohammed Riyaz Ahmed 2 and R

Unit 8 : Microprocessor Architecture

Computer Architecture Out-Of-Order Execution

CPU) the CPU Is the Brains of the Computer, and Is Also Known As the Processor (A Single Chip Also Known As Microprocessor)

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem