Research Collection

Doctoral Thesis

Shared Virtual Memory for Heterogeneous Embedded Systems on Chip

Author(s): Vogel, Pirmin

Publication Date: 2018

Permanent Link: https://doi.org/10.3929/ethz-b-000292606

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Shared Virtual Memory for Heterogeneous Embedded Systems on Chip

Diss. ETH No. 25085

Shared Virtual Memory for Heterogeneous Embedded Systems on Chip

A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich)

presented by PIRMIN ROBERT VOGEL MSc ETH EEIT born on 10.06.1986 citizen of Entlebuch LU, Switzerland

accepted on the recommendation of Prof. Dr. Luca Benini, examiner Prof. Dr. Marko Bertogna, co-examiner

2018

Acknowledgments

Obtaining the doctorate marks a major milestone in my career. I would like to express my gratitude to a number of people whose support stands at the basis of this achievement. First of all, I would like to thank my supervisor Prof. Luca Benini for giving me the opportunity to pursue a PhD at IIS, for guiding me throughout this challenging time but still giving me the required freedom, and for his interest in my research. Second, I would like to thank my co-examiner Prof. Marko Bertogna for reviewing my thesis and for the exciting discussion during the examination. Special thanks go to Andrea Marongiu for his guidance during the last four years, for pushing and motivating me, and for his collaboration and the many fruitful discussions. I would like to thank the glorious J69.2 office crew for lifting the morale in many mysterious ways. More precisely, I thank David Bellasi for the joint, DDPA-enhanced, late-night speech and/or noise imitation sessions, Michael Gautschi for his comprehensive study on the interplay of paper yellowing and reading for interesting research articles during the soccer season, and Pascal Hager—holder of the J69.2 peace medal—for his politically (in)correct analyses comprising all aspects of life and which are always to the point. Moreover, I thank Andreas Kurth for never complaining but always staying positive and productive no matter the circumstances, and for his inspiring attitude, Michael Schaffner for mastering any problem of any complexity anytime and anywhere (AAAA rating), and for raising the K¨afeliinterrupt frequency to the next level, and finally Florian Glaser for his ingenious pranks and the life-sustaining external K¨afelievent triggers.

v vi ACKNOWLEDGMENTS

Doing research at the system level requires mature subsystems, strong infrastructure and technical support. I would like to thank all the members of the PULP team for their great work. Special thanks go to Germain Haugou and Alessandro Capotondi, as well as Igor Loi for providing me with customized soft- and hardware modules and supporting me with the adaption of existing components. At IIS, I could always count on stable and reliable infrastructure and working conditions thanks to many staff members doing good work in the background. In particular, I thank Frank G¨urkaynak, Beat Muheim, Christoph Wicki, Adam Feigin, Hansj¨orgGisler and Christine Haller. I also thank all my colleagues and companions at IIS, particularly Philipp Sch¨onle,No´eBrun, Pascale Meier, Jonathan B¨osser,Lukas Cavigelli, Benjamin Weber, Michael M¨uhlberghuber, Anne Ziegler, Andreas Traber, Mauro Salomon, Bj¨ornForsberg, Daniele Palossi, Andrea Bartolini, Davide Schiavone, Stefan Mach, Fabian Schuiki, Fredi Kunz and Hubert Kaeslin. Furthermore, I would like to thank all the people who—in some way—prepared me for and actually made me start this undertaking. This includes my advisors during previous projects at IIS, namely Harald Kr¨oll, Christian Benkeser, Sandro Belfanti, Stefan Altor- fer, David Tschopp and Prof. Qiuting Huang, as well as Thomas Dellsperger, Norbert Felber and Frank G¨urkaynak who counseled me and encouraged me to start a PhD. Moreover, I would like to express my gratitude to the numerous teachers, coaches and supporters in my early days for everything they have taught me. Finally, I thank my family and friends for admiring what I do and for supporting me at all times. I am very grateful to my parents for encouraging me and giving me the opportunity to go to university, and my brother Matthias for leading me the way during high school and the first years of my studies. Also, I thank Nathanael for sharing the same sense of humor, for the many, truly inspiring discussions about life and for our long-term friendship. Ultimately, I thank Flurina for all her love, for supporting me throughout so many years, and for her patience and understanding during the final phase of my PhD.

Entlebuch, September 2018 Pirmin Vogel Abstract

Modern embedded systems on chip (SoCs) are heavily based on heterogeneous architectures that combine feature-rich, general-purpose, multi-core host processors with massively-parallel and programmable many-core accelerators (PMCAs) to achieve high flexibility, energy efficiency and peak performance. While these heterogeneous em- bedded SoCs (HESoCs) are nominally capable of tremendous per- formance/Watt targets, the burden of effectively using them is a cumbersome task that is nowadays entirely left in the hands of the application programmers. The main challenges in traditional accelerator programming origi- nate from partitioned memory models between host and accelerators. Thanks to hardware-managed virtual memory (VM) and a multi-level cache hierarchy that together abstract away the low-level details of the memory system, the host sees memory as a flat resource. In contrast, PMCAs typically rely on physically-addressed scratchpad memories (SPMs) managed in software via explicit direct memory access (DMA) transfers for maximum energy efficiency. Sharing data between host and accelerators in a heterogeneous application thus requires the programmer to manually orchestrate data copies between virtually and physically addressed main memory sections, as well as to translate and modify virtual address pointers inside the shared data to point to the proper copy. While this may be acceptable for regular memory access patterns, it quickly becomes a performance bottleneck and is completely prohibitive for the heterogeneous implementation of applications operating on complex, pointer-rich data structures. This thesis focuses on the design of transparent, lightweight, zero- copy shared virtual memory (SVM) frameworks for HESoCs that

vii viii ABSTRACT allow application programmers to simply pass virtual address pointers between host and PMCA, and thereby improve both programmability and performance. As opposed to the full-fledged hardware designs for SVM found in high-performance computing (HPC) and high-end desktop systems, we strive for mixed hardware-software solutions that are i) better suited for area- and power-constrained embedded systems, ii) less intrusive to the hardware architecture of both host and PMCA, and iii) allow for greater flexibility. We start with a first lightweight SVM system suitable for regular memory access patterns typical for today’s data-parallel accelerator models. To support applications with irregular memory access patterns, we propose a lightweight, non-intrusive hardware extension combined with a extension for the PMCA, which allows for zero- copy sharing of pointer-rich data structures. Relying on PMCA-side helper threads for managing the address-translation hardware, the performance of this system lies within 50% of an idealized SVM system for purely memory-bound application kernels and within 5% for real applications. We further investigate the design of address-translation caches tailored to the needs of PMCAs and come up with a hybrid architecture that is configurable, scalable, and maps well to field- programmable gate arrays (FPGAs) (making the design also suitable for custom accelerators in FPGA-enabled HESoCs). Compared to related works, this design allows to increase the caching capacity by factors of 16x and more while achieving lower overall resource utiliza- tion and higher or comparable clock frequencies. Finally, we adapt our framework to leverage the dedicated address-translation hardware featured by some next-generation, high-end HESoCs for enabling SVM, far beyond what is achievable with standard development frameworks and software distributions. Zusammenfassung

Moderne eingebettete Ein-Chip-Systeme (SoCs) basieren mehr und mehr auf heterogenen Architekturen. Diese kombinieren funktionsrei- che, universell einsetzbare, mehrkernige Hauptprozessoren mit massiv parallelen, programmierbaren, vielkernigen Rechenbeschleunigern (PM- CAs) um sowohl ein hohes Mass an Flexibilit¨at, wie auch eine hohe Energieeffizienz und Rechenleistung zu erreichen. W¨ahrend solch hete- rogene, eingebettete SoCs (HESoCs) eine nominell enorm hohe Rechen- leistung bei geringem Energieverbrauch bieten, stellt dessen Nutzung und das Erreichen dieser Rechenleistung eine grosse Herausforderung dar, welche heutzutage komplett den Applikationsprogrammierern uberlassen¨ ist. Die Hauptschwierigkeit in der traditionellen Programmierung von Rechenbeschleunigern ruhrt¨ von den grundverschiedenen Speicher- modellen der verschiedenen Prozessortypen. Durch den Einsatz von dedizierter Hardware zur Speichervirtualisierung und Verwaltung von mehrstufigen Zwischenspeicher-Hierarchien k¨onnen die maschinenna- hen Details des Speichersystems abstrahiert werden: Der Hauptprozes- sor sieht Speicher als eine flache, uniforme Ressource. Im Gegensatz dazu verwenden PMCAs typischerweise physikalisch adressierte Zwi- schenspeicher, die explizit in Software und mittels Hilfskontrollern fur¨ direkten Speicherzugriff (DMA) verwaltet werden um eine maximale Energieeffizienz zu erreichen. Um in einer heterogenen Applikation Daten zwischen Hauptprozessor und Beschleunigern teilen zu k¨onnen, muss daher der Programmierer diese Daten zum einen manuell zwischen virtuell und physikalisch adressierten Bereichen im Hauptspeicher hin und her kopieren und zum anderen jegliche Zeiger innerhalb der geteilten Daten so modifizieren dass sie auf die richtige Kopie verweisen.

ix x ZUSAMMENFASSUNG

W¨ahrend diese aufw¨andige Prozedur fur¨ regul¨are Speicherzugriffsmus- ter akzeptierbar sein mag, wird sie schnell zum Flaschenhals und verhindert die Implementierung von Applikationen mit irregul¨aren Zugriffsmustern und komplexen, zeigerbasierten Datenstrukturen auf heterogenen Systemen. Diese Arbeit fokussiert auf den Entwurf von transparenten, leicht- gewichtigen Systemen fur¨ geteilten virtualisierten Speicher—auch bekannt als Shared Virtual Memory (SVM)—in HESoCs, die es Ap- plikationsprogrammierern erlauben virtuelle Zeiger ohne das Anlegen von Datenkopien zwischen Hauptprozessor und PMCA zu ubergeben¨ und dadurch sowohl die Programmierbarkeit verbessern, wie auch die effektiv nutzbare Rechenleistung erh¨ohen. Im Gegensatz zu den rein hardwarebasierten L¨osungen fur¨ SVM in Hochleistungsrechensystemen und Workstations, erforscht diese Arbeit gemischte Hardware/Software- L¨osungen welche i) besser geeignet sind fur¨ eingebettete Systeme mit eingeschr¨ankten Budgets fur¨ Chipfl¨ache und Leistungsaufnahme, ii) weniger invasiv sind in Bezug auf die Hardware-Architektur so- wohl des Hauptprozessors wie auch des PMCA und iii) eine gr¨ossere Flexibilit¨at bieten. Als erstes entwickeln wir ein schlankes SVM-System geeignet fur¨ regul¨are Zugriffsmuster die typisch sind fur¨ heutige datenparalle- le Rechenbeschleuniger. Um Applikationen mit irregul¨aren Zugriffs- mustern zu unterstutzen,¨ schlagen wir eine leichtgewichtige, nichtin- vasive Hardware-Erweiterung in Kombination mit einer Compiler- Erweiterung fur¨ PMCAs vor, welche das einfache Teilen von komple- xen, zeigerbasierten Datenstrukturen erm¨oglichen ohne Datenkopien anzulegen. Durch die Verwendung von PMCA-seitigen Hilfs-Threads zur Verwaltung der Adressubersetzungs-Hardware¨ ist die Leistung des resultierenden Systems innerhalb von 50% der Leistung eines idealen SVM-Systems fur¨ rein speicherlimitierte Applikationskernel und innerhalb von 5% fur¨ reale Applikationen. Im Weiteren untersuchen wird den Entwurf von Zwischenspeichern zur Adressubersetzung¨ spezifisch zugeschnitten auf die Anforderungen von PMCAs. Die von uns vorgeschlagene hybride Architektur ist sowohl konfigurier- wie auch skalierbar und l¨asst sich effizient auf Field-Programmable Gate Arrays (FPGAs) abbilden, weshalb sie auch geeignet ist fur¨ spezialisierte Rechnenbeschleuniger in FPGA-basierten HESoCs. Verglichen mit anderen g¨angigen Implementierungen erlaubt xi diese Architektur eine Steigerung der Speicherkapazit¨at um Fakto- ren von 16x und mehr bei gleichzeitig kleinerem Ressourcenbedarf und h¨oherer oder vergleichbarer Taktfrequenz. Schliesslich adaptieren wir unser System um dedizierte, vollwertiger Adressubersetzungs-¨ Hardware, wie sie gewisse Spitzen-HESoCs der n¨achsten Generation aufweisen, fur¨ SVM zu verwenden, weit daruber¨ hinaus was mit aktuel- len Entwicklungsumgebungen und Software-Distributionen m¨oglich ist.

Contents

Acknowledgments v

Abstract vii

Zusammenfassung ix

1 Introduction 1 1.1 Target Heterogeneous Architecture ...... 4 1.2 The Origin of the IOMMU ...... 6 1.3 IOMMUs for Shared Virtual Memory ...... 7 1.4 Shared Memory without IOMMU ...... 9 1.5 Contributions and Publications ...... 12 1.6 Outline ...... 15

2 Profiling Shared Memory Performance 17 2.1 Related Work ...... 18 2.2 Evaluation Platform ...... 19 2.2.1 FPGA Models ...... 19 2.2.2 Software Support ...... 22 2.3 Experimental Results ...... 22 2.3.1 Synthetic Workload ...... 23 2.3.2 MiBench ...... 25 2.3.3 ALPBench ...... 27 2.3.4 Collaborative Workloads ...... 27 2.4 Summary ...... 31

xiii xiv CONTENTS

3 Lightweight Shared Virtual Memory 33 3.1 Motivating Example ...... 34 3.2 Related Work ...... 36 3.3 Infrastructure ...... 38 3.3.1 Host ...... 38 3.3.2 PMCA ...... 39 3.3.3 Remapping Address Block ...... 40 3.3.4 Software ...... 41 3.4 Experimental Results ...... 45 3.4.1 Experimental Setup ...... 46 3.4.2 Synthetic Workload ...... 47 3.4.3 Collaborative Workloads ...... 53 3.5 Summary ...... 54

4 Sharing Pointer-Rich Data Structures 57 4.1 Related Work ...... 58 4.2 Infrastructure ...... 60 4.2.1 Remapping Address Block ...... 60 4.2.2 The tryx() Operation ...... 61 4.2.3 Compiler and Runtime Support ...... 63 4.2.4 Comparison with Existing Solutions ...... 65 4.3 Experimental Results ...... 67 4.3.1 Evaluation Platform ...... 68 4.3.2 Shared Virtual Memory Cost ...... 70 4.3.3 Pointer-Chasing Applications ...... 73 4.3.4 Synthetic Model ...... 78 4.4 Summary ...... 80

5 On-Accelerator Virtual Memory Management 83 5.1 Related Work ...... 85 5.2 Background ...... 86 5.2.1 Linux Page Table ...... 86 5.2.2 Page Table Coherence ...... 88 5.2.3 Remapping Address Block ...... 89 5.2.4 Host-Based RAB Miss Handling ...... 90 5.3 Infrastructure ...... 91 5.3.1 Page Table Coherence ...... 92 5.3.2 Securely Configuring the RAB on the PMCA . 94 CONTENTS xv

5.3.3 VMM Library for the PMCA ...... 95 5.4 Experimental Results ...... 98 5.4.1 Evaluation Platform ...... 98 5.4.2 Synthetic Benchmark Descriptions ...... 99 5.4.3 Miss-Handling Cost ...... 101 5.4.4 Synthetic Benchmark Results ...... 102 5.4.5 Real Application Benchmark Results ...... 107 5.5 Summary ...... 108

6 SVM for FPGA Accelerators 111 6.1 Related Work ...... 113 6.2 Infrastructure ...... 117 6.2.1 IOTLB Design for FPGA Accelerators . . . . . 118 6.2.2 SVM Management ...... 123 6.3 Experimental Results ...... 125 6.3.1 Evaluation Platform ...... 125 6.3.2 FPGA Resource Utilization ...... 128 6.3.3 Microbenchmarking and Profiling ...... 130 6.3.4 Real Traffic Patterns ...... 133 6.3.5 Comparison with Related Works ...... 140 6.4 Summary ...... 146

7 Full-Fledged IOMMU Hardware for SVM 149 7.1 Related Work ...... 151 7.2 Infrastructure ...... 152 7.2.1 SMMU Architecture and Operation ...... 152 7.2.2 SMMU Software Stack ...... 155 7.3 Experimental Results ...... 156 7.3.1 Evaluation Platform ...... 156 7.3.2 Shared Virtual Memory Cost ...... 158 7.3.3 Application Benchmark Results ...... 160 7.4 Summary ...... 167

8 Conclusions 169 8.1 Overview of the Main Results ...... 170 8.2 Outlook ...... 173

A Pointer-Chasing Application Descriptions 175 xvi CONTENTS

Acronyms 179

Bibliography 183

Curriculum Vitae 197 Chapter 1

Introduction

Fueled by the ever increasing need for better performance/Watt, modern computing systems are heavily based on heterogeneous ar- chitectures. These architectures combine general-purpose, multi-core host processors with massively parallel accelerators, such as special- ized hardware accelerators, field-programmable gate arrays (FPGAs), digital signal processors (DSPs), general-purpose graphics processing units (GPGPUs) and programmable many-core accelerators (PMCAs), and are increasingly adopted across all domains ranging from high- performance computing (HPC) down to low-power, embedded systems. While such systems are nominally capable of achieving extremely high GOPS/Watt, the burden of effectively using them is a major challenge to be addressed by the application programmers. What poses probably the most difficult obstacles in traditional accelerator programming is the complex memory systems adopted by heterogeneous systems resulting in partitioned memory models between host and accelerators. A regular application launched on the host sees memory as a flat resource. The host processor relies on a hardware-managed, multi-level cache hierarchy to keep the most frequently accessed data in fast, local memories close to the processing pipelines. It runs a virtual memory enabled (OS) and features specialized hardware to abstract away the details of the memory system from the application programmer. Besides hiding physical memory fragmentation, virtual memory (VM) provides each

1 2 CHAPTER 1. INTRODUCTION process running on the host with an own address space and the illusion of owning the whole system, thereby effectively isolating and protecting the different processes from each other. In addition, VM paves the way for demand paging, i.e., the OS can use cheaper storage space to pretend the availability of more main memory than actually present in the system. In contrast, accelerators feature local, private memories which are, for reasons of energy efficiency, physically addressed and explicitly managed in software via high-bandwidth direct memory access (DMA) transfers. Writing a program for heterogeneous systems thus implies orches- trating offload sequences of application kernels suitable for acceleration from the host processor to the accelerator that require explicit data management. This includes programming DMA engines to copy data to/from the accelerator and manually maintaining data consistency with explicit coherency operations such as flushing the data caches of the host. If partitioning program data for DMA transfers represents a significant effort for applications with regular memory access patterns, for which it is relatively simple to identify workload partitioning strategies amenable to DMA transfers, it literally becomes a nightmare for irregular programs based on complex data structures, such as pointer chasing, which imply data-dependent memory access patterns often impossible to predict statically. Offloading pointer-intensive com- putations to accelerators is incredibly burdensome, as data structures spanning several virtual memory pages must be moved to contiguous, nonpaged and uncached memory areas. On top of that, any virtual address stored inside the data structures (typically initialized on the host side) needs to be adjusted to point to the copy in physically contiguous memory. Practically, this requires traversing the entire data structure at run time, which not only hampers programmability but also kills performance. In an effort to simplify the programmability of heterogeneous systems, initiatives such as the Heterogeneous System Architecture (HSA) foundation [1] are pushing for an architectural model where the host processor and the accelerator(s) communicate via coherent, shared virtual memory (SVM). The HSA memory architecture moves management of host and accelerator memory coherency from the developer’s hands down to the hardware. This enables direct and transparent access to system memory from both sides, eliminating the 3 need for explicit management of different memories. In this scenario, an offload sequence simply consists of passing virtual memory pointers to shared data from the host to the accelerator, in the same way that shared memory, parallel programs pass pointers between threads running on a central processing unit (CPU). Undoubtedly, this allows to break the reliance of the accelerator on the host for data management and thereby eases program writing and compiler implementations of offload techniques. Over the past decade, shared virtual memory (SVM) has thus been adopted not only in the HPC domain, where FPGA- and GPGPU-based heterogeneous systems efficiently accelerate big data workloads and applications running in the cloud [2–4], but also in high-end, desktop processors equipped with embedded GPGPUs [5, 6]. To enable the accelerators in such SVM-enabled systems the handling of addresses in paged, virtual memory, they rely on specialized software stacks combined with dedicated hardware blocks such as input/output memory management units (IOMMUs) [7–9] to perform the virtual-to-physical address translation for the accelerators. As opposed to the address translation hardware found in CPU cores, these IOMMUs must sustain the significantly higher degrees of parallelism exhibited by modern accelerators, which leads to substantially higher design complexity as well as heavy modifications to the accelerator architectures themselves [10–12]. While heterogeneous, coherent SVM can be justified in the context of high-end systems, it is probably not affordable for accelerators in low-power, embedded systems on chip (SoCs) with much tighter constraints—just like data caches and associated coherency protocols, which are typically replaced by software-managed memories for increased scalability and maximum energy efficiency. Indeed, except for a few high-end, GPGPU-based systems with completely closed hard- and software stacks [13, 14], SVM is still not widely adopted in the embedded systems domain. Most systems stick to less sophisticated accelerator interaction models imposing severe limits on efficiency and programmability. The focus of this thesis lies on the development of a mixed hardware- for enabling lightweight SVM and thereby im- proving both performance and programmability of power- and area- constrained heterogeneous embedded systems on chip (HESoCs). 4 CHAPTER 1. INTRODUCTION

Host Cluster-Based Many-Core

CPU CPU Cluster Cluster L1 Mem L1 Mem L1 $ L1 $ L2 Mem MMU MMU Cluster L2 $ L2 $ NI L1 Mem

Coherent Interconnect L1 Data Scratchpad Memory

IOMMU Bank Bank Bank Bank

NI

Interconnect Low-Latency Interconnect

DMA PE PE

Main Memory Shared L1 I$

Figure 1.1: Target platform high-level view.

1.1 Target Heterogeneous Architecture

The heterogeneous template targeted in this work combines two architecturally different processing units on a single chip. As shown in Fig. 1.1, the central component of this HESoC is a powerful general-purpose multi-core CPU (the host), which is equipped with a multi-level cache-coherent memory hierarchy and runs a full-fledged OS. To improve overall performance/Watt, the host can offload critical computation kernels to a PMCA that consists of several tens of simple processing elements (PEs) [15–17]. The type of PMCA that we consider leverages a multi-cluster design to overcome scalability limitations [15, 18, 19]. Per cluster, multiple PEs share a level-one (L1) instruction cache and an L1 data scratchpad memory (SPM), both multi banked. The shared level-two (L2) SPM as well as the L1 SPMs of the other clusters can be accessed through the global interconnect, albeit at a higher latency. The main, off- chip dynamic random-access memory (DRAM) is physically shared among the host and the PMCA [1], meaning that they both have a physical communication channel to the main DRAM, as opposed to a more traditional accelerator model where the latter uses a private DRAM. To constantly feed the various processing engines with data and exploit the available computing resources, both the host and the PMCA leverage their internal memory hierarchy to keep the most frequently accessed data in fast, local storage. The host does so using 1.1. TARGET HETEROGENEOUS ARCHITECTURE 5 its hardware-managed cache hierarchy. For reasons of efficiency, the PMCA instead relies on multi-channel, high-bandwidth DMA engines and double-buffering schemes. While such schemes allow to efficiently overlap data movement with actual computation performed on data available in the cluster-internal L1 SPMs, they usually require heavy application refactoring. To address this issue, suitable frameworks and application programming interfaces (APIs) for image processing applications have been demonstrated to effectively automate DMA and SPM manage- ment in PMCAs, thereby also optimizing overall performance [20]. Alternatively, a software cache [21] may be used to to ease the im- plementation of more irregular tasks and in turn sacrifice some of the efficiency of the PMCA. A second considerable difficulty in programming such HESoCs stems from the different memory abstraction and management tech- niques employed in the two processing components. The host runs a full-fledged OS with support for VM, which brings several benefits indispensable for modern computing systems. In contrast, the memory inside the PMCA is all physically addressed and explicitly managed by the PEs, which have no inherent VM support for efficiency reasons. Despite direct access to the physically shared main memory, the PMCA cannot access a shared data element through the virtual address obtained from the host. Instead, the programmer must copy the shared data to a physically contiguous, nonpaged and uncached memory region and pass the corresponding physical address to the PMCA. On top of that, any virtual address stored inside the shared data (typically initialized on the host side) needs to be adjusted to point to the copy in contiguous memory. For applications based on complex data structures, e.g., pointer chasing, this requires traversing the entire data structure at run time, which not only hampers programmability but also kills performance [22]. To simplify programmability and allow the sharing of virtual address pointers between the host processor and the PMCA, an input/output memory management unit (IOMMU) [7–9] may be placed in front of the PMCA. This unit translates virtual addresses as seen by the application offloaded to the PMCA to their physical counterpart in main memory, similar as the memory management unit (MMU) does for host processor cores. As such, it provides the PMCA with the 6 CHAPTER 1. INTRODUCTION same view of the memory system as the accelerated process running on the host. Virtual address pointers have the same meaning on both the host and the PMCA and can simply be exchanged, thereby allowing for fine-grained, zero-copy task offloading.

1.2 The Origin of the IOMMU

Input/output memory management units (IOMMUs) potentially allow for zero-copy SVM between host and PMCA. However, heterogeneous SVM is just one of many use cases of IOMMUs [23]. Originally, IOMMUs were introduced in server-grade systems to protect the OS from malicious or faulty DMA devices and drivers employed in high-throughput input/output (I/O) devices with direct memory access [24]. Typically, high-throughput I/O devices such as multi-Gb/s network interfaces or disk controllers directly interface with the OS kernel for higher performance. To this end, the kernel allocates data buffers in kernel-space memory which are then asynchronously accessed by the device through DMA. Without an IOMMU, giving the device access to main memory not just allows it to read or write the dedicated buffer, but it exposes the whole system. With an IOMMU, the memory accesses of the device can be confined to the intended data buffers. Fig. 1.2 visualizes this primary use case in a Linux system. The kernel allocates and manages a dedicated data buffer in kernel-space memory through the Linux DMA API 1 . Instead of sharing the physical address of the data buffer with the device, the kernel uses the Linux IOMMU API to create a dedicated input/output virtual address (IOVA) space 2 , and map the physical data buffer to this new address space 3 . Then, it configures the IOMMU hardware and assigns it to the IOVA space just created 4 (again through the IOMMU API). Instead of the physical address of the data buffer, the kernel just shares the IOVAs with the device. The IOMMU then translates IOVAs of incoming memory requests from the I/O device to the corresponding physical addresses according to the associated IOVA space. Transactions for which the IOVA cannot be resolved to a physical address, i.e., which are not mapped to the IOVA space, are aborted and generate a fault. This way, accesses of I/O devices to physical memory belonging for example to 1.3. IOMMUS FOR SHARED VIRTUAL MEMORY 7

Kernel Virtual Memory I/O Physical Memory Virtual Memory 4 Assign IOVA space

Device bu✁ er to IOMMU. obtained though IOVA space 2 DMA API 1 obtained through IOMMU API

Other, unrelated kernel memory

3 User-Space Map device Process'

bu✁ er to Virtual Memory IOVA space.

Stack & Heap

Figure 1.2: Memory protection using a dedicated IOVA address space combined with an IOMMU. user-space processes or different kernel subsystems do not get past the IOMMU, causing at most the corresponding I/O device instead of the entire system to fail. The same concept can be extended to support secure and efficient I/O device operation for guest OSs in virtualization environments. Besides protection, the IOMMU can also pretend device buffers to be physically contiguous which allows for improved DMA performance. Similarly, the IOMMU can be used to remap entire memory sections, e.g., to allow for legacy 32-bit DMA devices in 64-bit host systems [23]. The cost of setting up and removing individual IOVA mappings are high. These operations quickly become the main bottlenecks in high-throughput I/O scenarios and are thus addressed in litera- ture [24,25]. However, to make IOMMUs also suitable for fine-grained SVM in heterogeneous computing is a different story and requires major modifications.

1.3 IOMMUs for Shared Virtual Memory

Over the past decade, IOMMU technology has become more and more adopted in consumer systems. For example, many modern high-end desktop processors with support for virtualization also feature an IOMMU [7, 8] previously targeting primarily at secure and efficient I/O device operation for guest OSs (refer to Sec. 1.2). Making the technology also usable for sharing virtual user-space memory with 8 CHAPTER 1. INTRODUCTION accelerators, such as the powerful GPGPUs embedded in some high-end processors [5,6], not only required adaptions to the IOMMU as well as to the accelerator hardware [10–12], but also major modifications to OS kernels and APIs, hardware drivers, runtime systems and programming models. Finally, these efforts led to the formal specification of SVM in heterogeneous programming frameworks such as OpenCL [26]. Similarly, several FPGA- and GPGPU-powered heterogeneous systems from the HPC domain support SVM [2–4] through dedicated, IOMMU-like hardware and suitable software stacks. In the embedded systems world, the situation is however different. Some high-end, ARM-based HESoCs [13,14] indeed support SVM, e.g., according to the OpenCL 2.0 specifications [26]. But the corresponding software stacks are completely closed and the details not known to the public. At the time of writing, fully-programmable HESoCs with IOMMU support [9] are gradually becoming available. For example, the next-generation Xilinx Zynq UltraScale+ MPSoC [27] features a full-fledged hardware IOMMU [28]. However, studying the corre- sponding implementations for ARM-based systems in the Linux kernel reveals still severe limitations regarding the sharing of user-space virtual memory with I/O devices or accelerators. Instead of directly associating the user-space process with the IOMMU and letting it simply translate accesses of the accelerator according to this virtual address (VA) space, the IOMMU API and the hardware drivers always generate an empty IOVA space when initializing the IOMMU. Using a custom kernel driver module, user-space process memory must then explicitly be pinned and mapped to this IOVA space to make it accessible through the IOMMU. This is less suitable for implementing SVM than for letting the IOMMU serve its original purpose of isolating the host system from malicious or faulty DMA devices and drivers [24]. Indeed, the tools coming with the UltraScale+ MPSoC [29] only contemplate IOMMU operation in combination with DMA transfers initiated by the host. Fig. 1.3 illustrates this scenario. To initiate the transfer of a shared data element between main memory and the SPM inside the accelerator, the user-space application communicates with a custom kernel-level driver module. This module maps the memory pages of interest from 1.4. SHARED MEMORY WITHOUT IOMMU 9

a) Abstraction Layer: b) User-Space Process' I/O User-Level Shared Data User-Space Application Virtual Memory Physical Memory Virtual Memory Software

1 Shared Data 2 3 Shared Data mapped to placed in section IOVA space. IOMMU DMA obtained through Custom Driver special allocator. Kernel-Level API API Linux Software 2 IOMMU DMA Kernel Kernel Driver Driver Virtual Memory Accelerator Memory 4 Shared Data 1 Host-initiated Hardware Host Processor IOMMU DMA Accelerator mapped to kernel DMA transfer virtual memory.

Figure 1.3: IOMMU operation in combination with DMA transfers initiated by the host: a) interaction of hardware and software layers and b) different address spaces involved. user space to kernel space and pins them in memory 1 such that they cannot be moved by the kernel 1 . Then, it initializes the IOMMU hardware through the corresponding API and driver 2 . During this process, the IOVA space is generated and the memory pages of interest are mapped to it. The module configures the DMA engine through the Linux DMA API and the hardware driver 3 . Finally, the DMA engine performs the data transfer between the SPM and main memory 4 . The DMA engine uses IOVAs only which are translated by the IOMMU according to the associated IOVA space. While this scheme is suitable for more traditional, host-centric accelerator models, i.e., where the accelerator behaves like a passive slave unit and relies on the host processor for data and memory management, it is clearly not sufficient to enable efficient data sharing in modern, accelerator-centric systems.

1.4 Shared Memory without IOMMU

While IOMMUs are slowly being adopted in some high-end HESoCs [27] it is doubtful whether the costly and complex hardware is also suitable for embedded systems with tighter area and power constraints. In fact, most of today’s embedded systems still stick to less sophisticated

1For example, by using the get user pages() kernel API function in Linux. 10 CHAPTER 1. INTRODUCTION

a) Abstraction Layer: b) User-Space Process' User-Level User-Space Application Virtual Memory Physical Memory Software Constant 3 O ✁set Shared data allocated in Contiguous contiguous

Custom Driver CMA bu✁er. Kernel-Level Bu✁er 3 Software 1 2 Linux Kernel Kernel Virtual Memory Accelerator Memory 4 2 Contiguous Hardware Host Processor DMA Accelerator Kernel-Space Accelerator-

Bu✁ er initiated DMA accesses

Figure 1.4: CMA-based memory sharing in IOMMU-less HESoCs with autonomous accelerators: a) interaction of hardware and software layers and b) different address spaces involved.

accelerator interaction and memory sharing models. Typically, em- bedded systems without IOMMU support rely on contiguous memory allocation for giving I/O devices access to shared memory. For example, graphics processing units (GPUs) of many middle- and low-end HESoCs as well as camera sensors attached to such systems use the Linux contiguous memory allocator (CMA) [30]. How this mechanism can be used for giving an accelerator access to user-space memory is illustrated in Fig. 1.4. At system startup, the CMA pre-allocates a physically contiguous memory section 1 . When an application wants to share data with an accelerator, a custom driver module is required to request memory from this pre-allocated section and allocate a kernel-space buffer (in kernel virtual memory) 2 . Using the driver module, this contiguous kernel-space buffer can be exposed to the user-space application through an mmap() system call 3 . The user-space application must then allocate all shared data in the physically contiguous buffer using a custom malloc() function. After passing the address of the contiguous buffer to the accelerator, the accelerator can autonomously access the shared data and initiate DMA transfers between its SPMs and the contiguous buffer 4 . Since the shared memory pages are both contiguous in virtual (kernel- and user-space) and physical memory, virtual-to-physical address translation simplifies to applying a constant offset. 1.4. SHARED MEMORY WITHOUT IOMMU 11

User-Space Process' Virtual Memory Physical Memory

Shared data allocated using standard malloc(). Static memory split 1 2 Host-initiated copy Accelerator between virtual and Memory shared memory

Accelerator- initiated DMA accesses

Figure 1.5: Copy-based shared memory in IOMMU-less HESoCs.

This approach allows memory sharing with modern, flexible accel- erators with self-managed memories such as caches and SPMs, without relying on the host for data management. However, it implicates several substantial drawbacks (besides the lack of protection from malicious or faulty DMA). First of all, it requires a custom kernel-level driver module and a custom user-space memory allocator, which makes it difficult to use. Since the physical memory belonging to the pre-allocated, contiguous section is also available to other processes, the kernel might first need to copy data out of the pre-allocated section before the shared buffer can be allocated, which causes long and unpredictable delays2 [31]. In fact, there is no guarantee that the pre-allocated section can be made available at all [32]. Moreover, the size of the contiguous section pre-allocated at startup poses a hard limit on the size of the shareable data set. Finally, CMA returns uncached memory on processor architectures widely adopted in the embedded domain such as ARMv7. Letting the host operate on such memory is clearly very inefficient. In practice, many HESoCs that allow accelerators to proactively fetch data from main into local memories thus still rely on copy-based shared memory [33, 34]. This scheme is visualized in Fig. 1.5. The physically shared main memory is statically split into two sections: one being exclusively accessed by the host via cached, paged virtual addressing, and a second that is accessed by both the host and the accelerator via uncached, contiguous physical addressing 1 . Before

2This typically happens when using the camera on some low-end smartphones. 12 CHAPTER 1. INTRODUCTION

User-Space Process' Virtual Memory Physical Memory

Shared data allocated using standard malloc().

Accelerator Memory

Accelerator- initiated DMA accesses

Figure 1.6: The SVM system developed in this thesis. and after every computation offloading to the accelerator, the host must copy the relevant data between the two sections 2 . This approach requires minimal hardware and OS or driver support, but comes at the expense of several substantial drawbacks similar to CMA-based solutions. First, besides the cost for the continuous data copies, the static splitting of the main memory is far from optimal, particularly considering the limited memory capacity of embedded systems and the constraints it poses on the size of the shareable data set. Second, copy- based data sharing is not suitable for the acceleration of pointer-rich applications, as any virtual-address pointers inside the shared data need to be modified to point to the copy in the physically contiguous section, requiring a complete re-write of the accelerated code and application-specific offload sequences. Third, the performance of the host code is also reduced, as to avoid coherency problems mutually exclusive access to the shared data is usually employed.

1.5 Contributions and Publications

The focus of this thesis is on the development of a mixed hardware- software framework for enabling transparent, lightweight, zero-copy SVM in power- and area-constrained HESoCs. With this framework, sharing data between host and accelerator becomes as simple as passing a virtual address pointer to the accelerator. As shown in Fig. 1.6, the accelerator itself can then fetch the data from user-space SVM without the need for remapping the shared data to a different address space, or for specialized memory allocators and the like. As such, the framework BIBLIOGRAPHY 13 allows the accelerator to directly operate on pointer-rich data structures without relying on the host for data management, and greatly simplifies programmability and improves performance of HESoCs. The key contributions of this thesis can be summarized as follows.

1. A first, lightweight SVM system has been designed that is suitable for a wide range of applications with regular memory access patterns typical for today’s data-parallel accelerator models. 2. To support applications featuring irregular memory access pat- terns and zero-copy sharing of pointer-rich data structures, an SVM framework has been designed that relies on lightweight hardware extensions, which are not intrusive to the accelerator cores and the host processor, and a compiler extension that automatically protects the accelerator’s accesses to SVM. 3. To perform the virtual-to-physical address translation for the accelerator’s accesses to SVM, the design relies on a hybrid translation lookaside buffer (TLB) architecture designed in this thesis. This TLB architecture is configurable, scalable, and maps well to FPGAs (making it also suitable for custom FPGA accelerators in HESoCs). 4. This TLB is completely managed in software, either by a kernel- level driver module running on the host, or by accelerator-side helper threads using a virtual-memory management (VMM) software library developed in this thesis. 5. Finally, this SVM framework is adapted to enable the use of full- fledged, hard-macro IOMMUs found in next-generation, high-end HESoCs for SVM. The resulting system is then compared with the lightweight SVM system developed in this thesis.

The content of this thesis and the main contributions except for Chap. 7 have been published to a large extent in the following publications: [35] P. Vogel, A. Marongiu, and L. Benini, “An evaluation of memory sharing performance for heterogeneous embedded SoCs with many- core accelerators,” in Proc. Int. Workshop on Code Optimisation for Multi and Many Cores (COSMIC), 2015, pp. 6:1–6:9 14 BIBLIOGRAPHY

[36] P. Vogel, A. Marongiu, and L. Benini, “Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs,” in Proc. IEEE/ACM Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2015, pp. 45–54

[37] P. Vogel, A. Marongiu, and L. Benini, “Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs,” IEEE Trans. on Parallel and Distributed Systems, vol. 28, no. 7, pp. 1947–1959, 2017

[38] P. Vogel, A. Kurth, J. Weinbuch, A. Marongiu, and L. Benini, “Efficient virtual memory sharing via on-accelerator page table walking in heterogeneous embedded SoCs,” ACM Trans. on Embedded Computer Systems, vol. 16, no. 5s, pp. 154:1–154:19, 2017

[39] P. Vogel, A. Marongiu, and L. Benini, “Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU,” IEEE Trans. on Computers, submitted for publication The designed SVM framework is an integral part of HERO, the Heterogeneous Embedded Research Platform presented in the following paper and available to the public as open source: [40] A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, and L. Benini, “HERO: Heterogeneous embedded research platform for exploring RISC-V manycore accelerators on FPGA,” 2017, https://arxiv. org/abs/1712.06497 Further advancements to the SVM system presented in this thesis, such as accurate and effective TLB prefetching, and a VM-aware DMA engine, are presented and evaluated in this follow-up paper: [41] A. Kurth, P. Vogel, A. Marongiu, and L. Benini, “Scalable and efficient virtual memory sharing in heterogeneous SoCs with TLB prefetching and MMU-aware DMA engine,” 2018, https://arxiv.org/abs/1808. 09751 The following two conference papers on ultrasound imaging com- prise the author’s research carried out for a different research project and are not covered in this thesis: [42] P. Vogel, A. Bartolini, and L. Benini, “Efficient parallel beamforming for 3D ultrasound imaging,” in Proc. ACM/IEEE Great Lakes Symposium on VLSI (GLSVLSI), 2014, pp. 175–180 1.6. OUTLINE 15

[43] P. Hager, P. Vogel, A. Bartolini, and L. Benini, “Assessing the area/power/performance tradeoffs for an integrated fully-digital, large-scale 3D-ultrasound beamformer,” in Proc. IEEE Biomedical Circuits and Systems Conf. (BioCAS), 2014, pp. 228–231

1.6 Outline

The remainder of this thesis is organized as follows. Fig. 1.7 gives on overview of the structure. Chap. 2 discusses the effect of shared mem- ory interference on performance in HESoCs without support for SVM. The design of a mixed hardware-software solution for lightweight SVM for accelerators targeting application kernels with regular, streaming- type access patterns to shared main memory is presented in Chap. 3. To support the heterogeneous execution of applications with irregular access patterns and/or operating on pointer-rich data structures, we have designed a suitable SVM system that is discussed in Chap. 4. The performance of this SVM scheme can be substantially improved at low hardware cost by using the available compute resources of the PMCA and a VMM library to directly manage the VM hardware on the PMCA as discussed in Chap. 5. To reduce the overall management overhead and further improve performance, the employed VM hardware can be extended and optimized at low hardware cost as proposed in Chap. 6 to better match the requirements of modern accelerator architectures and improve the suitability for FPGA accelerators. Adapting the software of the proposed SVM scheme also enables the use of hard-macro IOMMUs found in next-generation HESoCs [27] to be used for SVM. Chap. 7 discusses these steps and provides a comparison of such a design with the SVM scheme proposed in this work. Chap. 8 concludes the thesis. 16 BIBLIOGRAPHY

1 Introduction

2 3 Pro✁ ling shared Need for SVM Lightweight SVM memory performance for regular memory access patterns

Enable SVM for more complex applications.

4 SVM for irregular memory access patterns and pointer chasing

Improve performance.

Optimize Optimize hardware. software.

5 On-accelerator 6 Optimized VM VM management hardware for accelerators

Hard-macro 7 SVM with IOMMUs of hard-macro next-gen HESoCs IOMMUs

8 Conclusions

Figure 1.7: Thesis structure. Chapter 2

Profiling Shared Memory Performance

Physical main memory sharing between host processor and acceler- ators in heterogeneous embedded systems has essential advantages compared to more traditional architectural models relying on private component memories. First of all, it paves the way for improved system programmability through shared memory communication as endorsed by initiatives such as the HSA Foundation [1]. Second, physical main memory sharing helps to reduce component count and power consumption by replacing multiple, private off-chip memories with a single memory component. Third, giving accelerators access to main memory allows to reduce the amount of silicon area spent for private on-chip memory. Altogether, it allows to reduce both system and application development cost which is key in the embedded systems domain. However, allowing direct access to main memory from both the host and its accelerators increases the potential for interference on the memory system and can have a negative impact on performance. In traditional accelerator-based architectures, such as GPGPUs equipped with private DRAM, an offload sequence would consist of a large data transfer from the main memory into the device DRAM followed by a coarse-grained computation stage where only the host accesses the main

17 18 CHAPTER 2. SHARED MEMORY PROFILING memory. In contrast, accelerators in HESoCs leverage fine-grained data transfers including DMA during kernel execution to repeatedly access the shared main memory in parallel to the host. In this chapter, we analyze memory interference effects in shared memory HESoCs for both host and accelerator kernels under different load conditions. We first provide an overview of related works on evaluating the performance of shared memory systems in Sec. 2.1. Next, we present our evaluation platform in Sec. 2.2. It is based on the Xilinx Zynq-7000 All-Programmable SoC with multiple DMA engines instantiated in the programmable logic (PL) that are programmed according to the shared memory access pattern of a real, off-chip accel- erator. Sec. 2.3 presents experimental results when the host and the accelerator execute independently as well as for collaborative scenarios, i.e., where they form a functional processing pipeline. These results demonstrate that the effective memory bandwidth and performance can be reduced substantially due to interference effects, especially if the system is heavily loaded.

2.1 Related Work

Main memory bandwidth performance in shared memory systems has been thoroughly studied for chip multiprocessors (CMPs) [44–47]. In the context of heterogeneous high-end SoCs, which combine multi-core host CPUs with GPGPUs, it has been demonstrated that, while such systems have the potential to outperform machines equipped with more powerful discrete GPGPUs thanks to the improved offload performance, the performance of both the host and the GPGPU can be notably affected by contention on the shared memory [48,49] and even more substantially by GPGPU-induced last-level cache (LLC) spills [50]. Regarding PMCA-based HESoCs, similar effects have not yet been studied. Existing PMCAs targeting the embedded domain are not yet integrated to general-purpose host CPUs and require FPGA-bridges to connect to the shared memory system of the host, which severely limits the memory bandwidth of the PMCA [15, 16]. Simulating an entire HESoC means to trade complexity for accuracy. The accurate modeling of the execution of complex applications on top of a full-fledged OS at architecture level is not practicable. 2.2. EVALUATION PLATFORM 19

In this chapter, we explore the impact of shared memory inter- ference on both host and PMCA performance in a HESoC using representative workloads as well as synthetic traffic for corner cases. Specifically, we use the Xilinx Zynq-7000 SoC to model a real, PMCA- based HESoC. This chip features a dual-core host processor that accesses the main memory through a two-level, coherent cache system, also accessible by a PMCA block placed in the PL via two DMA engines. The traffic patterns of these DMA engines can be extracted from real, embedded PMCAs [15, 16] to accurately model the basic fundamental architecture for main memory sharing between host and PMCA. Previous studies using the same SoC focused on evaluating the maximum memory bandwidth as seen by accelerators when connecting to different levels in the memory hierarchy of the host [51] or on implementing hardware accelerators that operate in tandem with the host [52, 53]. Interference effects occurring during concurrent host and accelerator operation have not yet been studied.

2.2 Evaluation Platform

To explore the impact of shared memory interference in PMCA- based HESoCs, we used the Digilent ZedBoard [54] to implement the evaluation platform depicted in Fig. 2.1. The architectural key parameters are summarized in Tbl. 2.1.

2.2.1 FPGA Models

The ZedBoard features a Xilinx Zynq-7020 All-Programmable SoC [55], which can be divided into two main parts. The PL consists of a Xilinx Artix-7 FPGA. The programmable system (PS) features interconnects, a DRAM controller, and the application processing unit (APU) with a dual-core ARM Cortex-A9 CPU at its heart. Each core has separate L1 instruction and data caches with a size of 32 KiB each. The APU further offers 512 KiB L2 cache that connects to the high-priority port of the DRAM controller. The DRAM controller has three more slave ports: two of them connect to high-performance (HP) Advanced eXtensible Interface Bus (AXI) slave ports of the PS, and the third 20 CHAPTER 2. SHARED MEMORY PROFILING

PL BRAM BRAM = 150 MHz = 120 MHz = CLK CLK f 64 64 f 32 32

AXI BRAM AXI BRAM LMB BRAM AXI BRAM Controller Controller Controller Controller

DMA DMA MicroBlaze AXI4-Lite

APU AXI4 AXI4 f = 300 MHz CLK 32 Dual-Core ARM Cortex-A9 CPU Master L1 Caches AXI AXI4-Lite 64 1.2 GiB/s 1.2 GiB/s Slave L2 Cache & 64 64 Controller

2.4 GiB/s PS 64 Zynq-7020

HP0 AXI HP2 AXI APU 2.4 GiB/s 32 DRAM Controller DRAM fCLK = 300 MHz DDR3-600

fCLK = 300 MHz

Figure 2.1: Schematic of the heterogeneous system model including peak bandwidths for the interfaces of interest. one is shared among all other including general-purpose AXI ports. The host is implemented in the PS and is running Xilinx Linux 3.13. The PMCA is modeled using Xilinx intellectual (IP) cores implemented in the PL. To the aim of evaluating main (DRAM) memory sharing performance at the system-level, we are only interested in memory behavior of the host and the PMCA at their boundaries. Indeed, the only transactions that affect memory sharing performance are those that miss in the internal memory hierarchies (L2 cache misses on the host side and DMA transfers of the PMCA). For this reason, we model the PMCA as DMA engines plus a controlling core (a MicroBlaze processor [56]). The MicroBlaze is responsible for injecting DMA transactions into main memory following realistic patterns generated by programs running on a real PMCA captured into traces (see Sec. 2.3 for more details). To inject traffic into the shared memory system, two AXI Central DMAs are used. Using separate AXI4 interconnects, they are both 2.2. EVALUATION PLATFORM 21

Table 2.1: Architectural key parameters.

Clock Peak Characteristics Frequency Bandwidth Dual-Core ARM A9, Host 300 MHz 2.4 GiB/s 512 KB L2 cache 2 AXI Central DMAs, 150 MHz PMCA 2.4 GiB/s MicroBlaze controller 120 MHz Main 512 MiB 300 MHz 2.4 GiB/s Memory DDR3-600 DRAM

connected to separate HP AXI slave ports on the PS side and separate AXI block random-access memory (BRAM) controllers in the PL. The two BRAM controllers are connected to a dual-port BRAM with a size of 256 KiB. The interfaces are 64-bit wide and clocked at 150 MHz, which leads to a maximum bandwidth of 1.2 GiB/s per DMA engine.

DMA transfers can be setup by writing the proper configuration registers using the AXI4-Lite interfaces of the DMAs (dashed lines). This is the task of the upper right part in Fig. 2.1. The MicroBlaze processor is used to program the DMAs independently of the host, as it happens in real PMCAs. It accesses its BRAM (128 KiB) through local memory bus (LMB) interfaces, which feature a memory access latency of only 1 clock cycle [57]. The second port of the MicroBlaze processor’s BRAM is accessible from the host via AXI-Lite. This allows for synchronization and data exchange between host and PMCA during an offload sequence.

The maximum achievable clock frequency for the IP cores used to model the PMCA on the target FPGA is reported to be 120 MHz and 150 MHz on the AXI-Lite and AXI4 interfaces, respectively. Based on these values, we scaled down the clock frequency of the host and the main memory to 300 MHz such that the PMCA bandwidth to main memory roughly matches the host bandwidth to main memory (now re- duced to 2.4 GiB/s), similar to state-of-the-art heterogeneous platforms like the Tegra mobile processors from NVIDIA [58]. 22 CHAPTER 2. SHARED MEMORY PROFILING

2.2.2 Software Support

The host runs Xilinx Linux 3.13. The PMCA can be controlled by the host using a custom kernel-level driver. This driver allows to memory map all the devices connected to the AXI-Lite interconnect to user space using an mmap() system call. The host can then setup DMA transfers by writing source and destination address, and the transfer length to specific addresses. To let the MicroBlaze setup DMA transfers autonomously, the host first puts the MicroBlaze into reset state using the corresponding PL general purpose user reset (controllable by the system level control register). Then, the host can load the executable to the BRAM of the MicroBlaze. When releasing the reset signal, the MicroBlaze starts to execute the program and setup DMA transfers as specified by the loaded binary. During operation, the host and the MicroBlaze can communicate through the BRAM of the MicroBlaze. This allows, e.g., the host to change the frequency at which the DMA transfers are issued by writing a single value to a specific address in the BRAM of the MicroBlaze.

2.3 Experimental Results

To characterize the memory sharing performance in PMCA-based HESoCs under various corner cases and realistic load conditions, we set up different experiments. First, we use a parametric synthetic benchmark that we designed to control the L2 cache miss rate of the host in Sec. 2.3.1. On the PMCA side, traffic information profiled on a real many-core accelerator (the STHorm platform from STMicroelec- tronics [15]) is used to generate DMA requests representative of seven highly-parallel application kernels from the computer-vision domain. In Sec. 2.3.2 and Sec. 2.3.3, we use two representative benchmark suites for embedded systems, MiBench [59] and ALPBench [60], respectively, and evaluate the impact that varying amounts of interfering DMA traffic from the PMCA cause on their execution time on the host. This is done for different deployment schemes of the benchmarks over the two cores of the host processor. In Sec. 2.3.4, we then consider the effect of memory sharing interference in a collaborative scenario, i.e., where the host and the PMCA form a functional pipeline by 2.3. EXPERIMENTAL RESULTS 23 concurrently executing different kernels of the same application on subsequent input data.

2.3.1 Synthetic Workload

It is intuitive that the susceptibility of the host to memory interference generated by the PMCA is heavily dependent of the cache miss rate. We designed a synthetic benchmark that allows to enforce a given amount of L2 cache misses on the host, and a given DMA utilization rate on the PMCA. In this section we provide an assessment of performance degradation under different parameter configurations for this benchmark. The synthetic benchmark allocates an array with a size of 1 MiB and sequentially reads the whole array in chunks of 64 cache lines (2 KiB). Using single-instruction multiple-data (SIMD) instructions and post-indexed addressing, a single instruction is suffi- cient to read a full cache line and increment the read address. After every 64 cache lines, the read address is decremented such as to achieve the desired cache miss rate. Using this benchmark, the host bandwidth to main DRAM (shown in Fig. 2.2 a) ) can be evaluated as a function of the cache miss rate (on the x-axis) and the amount of traffic injected by the PMCA (different curves). The curves report the amount of DRAM traffic traffic in Bytes divided by the execution time of the benchmark. For zero cache misses, the DRAM bandwidth seen by the host is zero as all the requests get filtered by the caches. The DRAM bandwidth rapidly grows for increasing cache miss rates. For cache miss rates above 25%, the host DRAM bandwidth can decrease by 30 to 40% due to the traffic injected by the PMCA. To put these numbers in perspective of a realistic usage of the PMCA DMAs, we have profiled the execution of seven kernels (extracted from two representative applications) running on a real many-core accelerator, the STMicroelectronics STHorm [15]. The profiles for these kernels in terms of DRAM bandwidth usage are reported in Tbl. 2.2. Fig. 2.2 also reports markers on the right y-axis that indicate the bandwidth request generated by those kernels, to give an idea of the impact this would generate on the host DRAM bandwidth for varying cache miss rates. 24 CHAPTER 2. SHARED MEMORY PROFILING

a) b) 300 Detect, Keypoints 1 Detect, Keypoints 0.9 250 0.8 RBRIEF, Distance 200 RBRIEF, Distance 0.7 0.6 Gaussian, FAST Gaussian, FAST 150 0.5

DMAs off 0.4 100 DMAs 25% 0.3 DMAs 50%

Normalized to DMAs off 0.2 50 DMAs 80% Host Memory Bandwidth

Host DRAM Bandwidth [MiB/s] 0.1 DMAs 100% 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Cache Miss Rate Cache Miss Rate

Figure 2.2: Effect of different levels of DMA traffic injection on a) host DRAM bandwidth and b) normalized host memory bandwidth.

Table 2.2: PMCA application kernels.

Bandwidth Kernel Name Application Name Utilization Compute Face Detection 90.11% Detect Face Detection 5.43% Gaussian Blur Object Recognition 97.90% FAST Object Recognition 99.31% Compute Keypoints Object Recognition 1.63% RBRIEF Object Recognition 45.15% Distance Object Recognition 55.31%

The total memory bandwidth of the host for different levels of PMCA DMA traffic injection was also measured, and is shown in Fig. 2.2 b). Here, the host memory bandwidth was measured as the total data read/written (not just the transactions resulting from L2 cache misses) in Bytes, divided by the benchmark execution time. The numbers have been normalized to the case where the PMCA is not injecting any traffic. For cache miss rates below 5%, the total host memory bandwidth is reduced by less than 5%. Similar to the DRAM bandwidth, the total memory bandwidth of the host can be reduced by up to 40% due to the traffic injected by the PMCA. 2.3. EXPERIMENTAL RESULTS 25

2.3.2 MiBench

The MiBench embedded benchmark suite [59] includes several programs from different application domains such as automotive, consumer, network, office, security and telecommunications. The benchmarks were run on the host processor with different levels of memory traffic injected by the DMAs. The program execution times for the var- ious experiments were measured, averaged and normalized to the baseline (DMAs off ). Fig. 2.3 shows three bar groups for every benchmark. The left bar group corresponds to the case when host executes the corresponding benchmark only. In this case, the average program execution time is affected by less than 5% for most of the benchmarks—even if the DMAs are injecting traffic into the DRAM at the maximum rate. The main reason for this is the small number of cache misses generated by those benchmarks [59]. Second, memory transactions issued by the host are commonly handled with higher priority by the DRAM controller than transactions issued by the accelerator [55, 58]. However, it has to be noted that, even when executing in isolation, some benchmarks are significantly impacted by the background DMA traffic. The most notable ones are qsort, typeset and stringsearch, which exhibit slowdowns of up to 50%. In a more realistic scenario, the multi-core host processor runs multiple applications in parallel. The caches are then shared between the individual applications, and the amount of cache misses (and hence the DRAM traffic of each individual application) increases. To quantify the impact of the PMCA’s traffic to shared memory in such a scenario, the benchmarks were run on the first core while at the same time, we enforced a specific L2 cache miss rate by running the synthetic benchmark (see Sec. 2.3.1) on the second core. The results for this scenario with an L2 cache miss rate of 5% and 11% are visualized by the middle and right bar group for every benchmark in Fig. 2.3, respectively. It is evident that even with a relatively low L2 cache miss rate, the traffic injected by the PMCA DMAs has a much higher impact on the execution time of the benchmarks. On average, the execution time for the MiBench benchmarks increases by 32% if the PMCA DMAs are injecting maximum traffic to the loaded system, with peaks of up to 208% (for qsort). 26 CHAPTER 2. SHARED MEMORY PROFILING

2.0 Background tra c Background tra c with L2 cache miss with L2 cache miss rate of 11%. 1.8 rate of 5%.

1.6 No interfering host background tra c. 1.4

Execution Time Execution 1.2 Normalized Average Normalized

1.0

qsort lame bitcount typeset basicmath susan edges jpeg encode jpeg decode susan corners susan smoothing Automotive Consumer

2.0

1.8

1.6

1.4

Execution Time Execution 1.2 Normalized Average Normalized

1.0

sha dijkstra patricia

pgp encode pgp decode

blowfish encodeblowfish decode rijndael encoderijndael decode Security Network

2.0 DMAs off DMAs 25% DMAs 50% DMAs 80% DMAs 100% 1.8

1.6

1.4

Execution Time Execution 1.2 Normalized Average Normalized

1.0

FFT IFFT CRC32 Average stringsearch gsm encode gsm decode adpcm encodeadpcm decode O ce Telecommunications

Figure 2.3: Normalized average execution times for MiBench programs for different levels of DMA traffic injected by the PMCA. 2.3. EXPERIMENTAL RESULTS 27

2.3.3 ALPBench While the MiBench benchmarks represent a heterogeneous set of applications specifically targeted at embedded systems, they also have a number of limitations, which make them not entirely representative of the complex applications running on modern embedded host processors. First, they are sequential (single-threaded) programs. Second, they are quite small, both in terms of code and data footprint, which overall leads to a very small cache miss rates. In contrast, the ALPBench benchmark suite [60] consists of a set of parallelized complex media applications gathered from various sources, and modified to expose thread-level and data-level parallelism. These benchmarks are representative of computation-intensive and multi- threaded workload for high-end symmetric (SMP) hosts in HESoCs. The considered benchmarks are:

• FaceRec: Face recognition, derived from CSU Face Recognizer;

• SpeechRec: Speech recognition, derived from CMU Sphinx 3.3;

• MPGdec: MPEG-2 decode and

• MPGenc: MPEG-2 encode, both derived from MSSG MPEG-2 decoder/encoder.

A similar setup to the previous section was considered here. The main difference is that we did not have to consider “dummy” or synthetic interfering traffic on the second core, as the parallelization automatically distributes threads over both processor cores. Fig. 2.4 shows the results for this experiment. Even in absence of interfering host-generated background traffic, for such large workload, the effect of DMA operations issued on the main DRAM from the PMCA can lead to up to 45% longer execution time (25% on average).

2.3.4 Collaborative Workloads In modern heterogeneous SoCs, workloads are typically executed in a collaborative manner between host and accelerator. The application is started on the host processor, and those parts that are sequential in nature (or have limited parallelism) are executed there. When a 28 CHAPTER 2. SHARED MEMORY PROFILING

1.5 DMAs o 1.4 DMAs 25% DMAs 50% DMAs 80% 1.3 DMAs 100%

1.2

1.1 Execution Time Execution Normalized Average 1.0

Face Speech MPG MPG Average Rec Rec Dec Enc

Figure 2.4: Normalized execution time for ALPBench parallelized over two host cores for different levels of DMA traffic injection. point in the program is reached where high degrees of parallelism are available, an offload operation transfers the execution control to the accelerator [61]. The offload operation can be synchronous, i.e., the host processor is stalled while waiting for the PMCA to complete its part of the execution, but asynchronous schemes are also possible, where the host is allowed to execute code independent of the results of the offload. The latter case is clearly the most efficient from the point of view of resource exploitation, but it is not always possible to structure the application code so as to hide PMCA latency on the host side. For applications based on continuous repetitions of the same operations over a stream of data, e.g., image and audio stream cod- ing/decoding and filtering, it is always possible to structure the code as a functional processing pipeline. Fig. 2.5 illustrates this concept. Part of the stages of this pipeline, i.e., filters to be applied to the current frame or image, can be executed on the host, and the remaining stages are executed on the PMCA. It is therefore possible to implement a sort of buffering technique, in which, while the PMCA is busy processing a Frame N, the host can do progress applying a different processing kernel on Frame N + 1, rather than waiting for the PMCA to complete. An offload operation typically consists of a set of steps [61] each of which are subject to memory sharing-induced interference. First, information about code (binary) and data for the PMCA is gathered from the executing context of the host. The code itself may be compiled on-the-fly for the target PMCA, or simply linked and loaded as required. Then, depending on the underlying memory architecture,

2.3. EXPERIMENTAL RESULTS 29

✁ ✁ ✁ ✁ Host Host Kernels 0 O✁oad 0 - Out Host Kernels 1 O oad 1 - Out O oad 0 - In Host Kernels 2 O oad 2 - Out O oad 1 - In

PMCA PMCA Kernels 0 PMCA Kernels 1 PMCA Kernels 2

Time

DMA Transfers Stripe 0 - In Stripe 1 - In Stripe 0 - Out Stripe 2 - In Stripe 1 - Out Stripe 3 - In Stripe 2 - Out PMCA Kernel Stripe 0 Stripe 1 Stripe 2 Stripe 3

Figure 2.5: Pipelined execution of collaborative workloads. the host communicates code and data to the PMCA. In distributed memory systems, this communication translates in a real copy from one memory space to another. In shared memory systems, it is possible to avoid the copies and pass only pointers. However, if the PMCA does not support SVM, the host is still required to copy data from the original place in paged virtual memory into a physically contiguous, nonpaged memory section. In both cases, the offload operation implies read/write operations on the shared main memory, and thus it contributes to overall DRAM bandwidth sharing. As such, it is subject to performance degradation when many sharers in the system are present. The corresponding steps are denoted by Offload - Out and Offload - In in Fig. 2.5. The second operation that is subject to interference from main memory sharing is the part of the application that is executed on the host (denoted by Host Kernels in Fig. 2.5). Its sensitivity to main memory interference is clearly dependent of the communication- to-computation ratio (CCR) exhibited (an inherent property of the application code) and of the L2 cache miss rate (a property of the application and of the system features and global workload). The third operation that contributes to main memory sharing is DMA transfers issued by the offloaded PMCA kernel. Local memory reference inside the PMCA kernel are satisfied from private L1 SPMs and do not have an impact on DRAM bandwidth sharing. Fig. 2.5 also shows that this phase (denoted by PMCA Kernels) is typically organized to operate on a sequence of image stripes. This allows to implement double buffering schemes for DMA transfers, to hide their latency. The first stripe is copied into private L1 SPM, then the PMCA PEs can start computation. At the same time, the DMA is used to prefetch the second stripe. This allows continued execution. 30 CHAPTER 2. SHARED MEMORY PROFILING

Host Erode, Label & ROIs 0 Erode, Label & ROIs 1 ROD Time PMCA NCC 0 NCC 1 NCC 2

Host Line & Merge 0 Line & Merge 1 Line & Merge 2 Line & Merge 3 CT

PMCA CSC 0 CSC 1 CSC 2 CSC 3 CSC 4

✁ ✁ ✁ Host Hu✁ man Dec 0 Hu man Dec 1 Hu man Dec 2 Hu man Dec 3 MJPEG PMCA Dequantize & IDCT 0 Dequantize & IDCT 1 Dequantize & IDCT 2

Figure 2.6: Schematic pipelines of the collaborative workloads.

We explore the effect of memory sharing interference for three col- laborative workloads, parallelized with the above described pipelining scheme as visualized in Fig. 2.6: • Removed object detection (ROD): A normalized cross-correlation (NCC)-based used to detect abounded and/or removed objects in security-critical areas. After offloading NCC to the PMCA, the host executes the erode and label stages, as well as computation and classification of regions of interest (ROIs). • Color-based tracking (CT): A color structure code (CSC)-based algorithm for color segmentation and tracking in image streams. The host offloads the CSC stage to the PMCA before executing the line and merge stage. • Motion JPEG (MJPEG): Video decoding. The Huffman de- coding is executed on the host while the PMCA performs the dequantization and inverse discrete cosine transform (IDCT). If multiple applications are running on the host at the same time (a common situation in modern high-end embedded systems), the L2 cache miss rate is bound to increase. We explore this effect by running the synthetic benchmark described in Sec. 2.3.1 as interfering traffic in parallel with the collaborative applications on the host. Fig. 2.7 shows the average execution times for Offload, Host Kernels and PMCA Kernels for increasing L2 cache miss rates, normalized to the synchronous, un-pipelined case without any memory sharing- induced interference. Note that, although the individual components can slow down with respect to the baseline, the total execution time 2.4. SUMMARY 31

a) ROD b) CT c) MJPEG 1.20 1.20 1.20

1.15 1.15 1.15

1.10 1.10 1.10

1.05 1.05 1.05 Execution Time Execution 1.00 1.00 1.00 Normalized Average Normalized

O oad Host PMCA O oad Host PMCA O oad Host PMCA Kernels Kernels Kernels Kernels Kernels Kernels No Interference 5% Miss Rate, 52% Bandwidth Utilization 30% Miss Rate, 90% Bandwidth Utilization 11% Miss Rate, 72% Bandwidth Utilization 100% Miss Rate, 100% Bandwidth Utilization

Figure 2.7: Normalized average execution time of collaborative workload components for different levels of host-side interference. for pipelined execution is still up to 28%, 7%, and 2% faster for ROD, CT and MJPEG, respectively. The results show that:

1. When the target collaborative workload is the only application running, main memory sharing does not lead to significant delay for Offload or Host Kernels. PMCA Kernels experiences 5% delay for ROD and MJPEG (dark bars in Fig. 2.7 a) and c) ).

2. Additional traffic on the host leads to an increased L2 cache miss rate and causes all the components to experience significant delay (up to 20%).

3. Host Kernels for MJPEG does not exhibit any delay because this phase is compute bound. Most of the interference comes from the DMA-intensive PMCA Kernels and the memory-intensive Offload (Fig. 2.7 c) ).

2.4 Summary

In this chapter, we have evaluated the impact of memory sharing interference between a multi-core host processor and a PMCA within a HESoC. Our results show that the effective memory bandwidth and performance of both the host and the PMCA can be reduced signifi- cantly due to interference on the shared memory system, especially if the system is heavily loaded. 32 CHAPTER 2. SHARED MEMORY PROFILING

The extent of the performance degradation depends on the level of traffic injected by the PMCA DMAs as well as on the L2 cache miss rate of the host. For a set of real applications from the MiBench embedded benchmark suite [59], the performance degrades by only around 2.5% when running in isolation on a single host core, with peaks of up to 50%. In more realistic scenarios, where the multi-core host processor executes additional applications in parallel, the system is more susceptible to the memory interference due to the PMCA. In this case, the execution time of some benchmarks increases by up to 208% due to memory interference. Multithreaded applications, such as the benchmarks from the ALPBench benchmark suite [60], slow down by up to 45% due to the interference generated by the PMCA, even without additional background traffic on the host side. For collaborative workloads executed in a functional software pipeline, our results show that additional host-side interference can increase the execution time of the individual pipeline stages by up to 20%. At the interconnect level, the problem could be mitigated by adjusting priority in the DMA versus host L2 cache refill bandwidth share or by using bandwidth allocation mechanisms as supported by modern cache-coherent, system-level interconnect IP cores [62]. Another solution could be that of adding dynamic bandwidth re- adjustment capabilities linked to the application behavior, e.g., by up- regulating bandwidth for the PMCA when executing critical kernels [63] or for the host when many L2 cache misses are seen. However, such techniques can only mitigate interference effects through arbitration. To effectively reduce the required memory band- width, unnecessary memory accesses and costly memory-to-memory copies, e.g., due to the host preparing data for the PMCA during an offload phase, need to be avoided. To this end, the PMCA must have support for SVM. SVM not only helps to reduce shared memory interference by avoiding the need for costly memory-to-memory copies, but also by paving the way for letting the PMCA access the most recent data copies directly from the data caches of the host instead of the main DRAM. In addition, SVM enables to share virtual address pointers and thus allows for higher offload performance and, most importantly, substantially improved programmability. Chapter 3

Lightweight Shared Virtual Memory

To effectively reduce the memory bandwidth requirements in shared memory systems, unnecessary memory accesses and costly memory- to-memory copies, for example during preparation phases for offloads to accelerators need to be avoided. To this end, accelerators must have support for shared virtual memory (SVM), i.e., they must be able to access data buffers in shared memory based on virtual address pointers passed by the host. Besides reducing memory bandwidth requirements, SVM allows for reduced offload latency and paves the way for letting accelerators access the most recent data copies directly from the data caches of the host, which can further reduce offload latency by avoiding the need for cache flushes. But most importantly, SVM greatly simplifies the programming of heterogeneous systems, which is why it is endorsed by initiatives such as the HSA Foundation [1]. In this chapter, we present a lightweight, mixed hardware-software scheme enabling zero-copy SVM for HESoCs. It is suitable for a wide range of applications enabled by today’s state-of-the-art heterogeneous systems and the associated programming paradigms, which focus on data-parallel accelerator models. Under these paradigms, offload- based applications are typically structured as a set of regular loops containing predictable trip count and regular, streaming-type access

33 34 CHAPTER 3. LIGHTWEIGHT SVM patterns to main memory, and are running for a very large number of fully-independent iterations. These properties can be exploited to enable efficient SVM at low hardware complexity. Our design is based on a kernel-level driver module running on the host which, at offload time, pins the shared pages in memory and generates a table containing all the virtual-to-physical address mappings. This table is then used by the same driver to set up the input/output translation lookaside buffer (IOTLB) hardware according to update requests sent by the accelerator via interrupt. Using an IOTLB double-buffering scheme allows to hide the interrupt latency. In addition, the IOTLB can be set up in advance, i.e., without IOTLB misses ever happening and, in the ideal case, without the accelerator ever being stalled. For highly memory-bound kernels with very low operational intensities, the granularity of the updating mechanism can be reduced to trade off hardware resources (IOTLB size) and performance. We start with a motivating example for the presented scheme in Sec. 3.1. Sec. 3.2 discusses related work. The proposed technique and its implementation is described in detail in Sec. 3.3. Experimental results including an evaluation of the minimal operational intensity required to still guarantee maximum PMCA performance are presented in Sec. 3.4.

3.1 Motivating Example

Today’s state-of-the-art heterogeneous systems and the associated programming paradigms focus on a data-parallel accelerator model. Offloaded application kernels are usually structured as a set of regular loops containing predictable trip count and memory access patterns, and are running for a very large number of fully-independent iterations. A typical example of such a kernel is presented in Fig. 3.1. This kernel generates an output image of half the width of the input image by averaging adjacent columns. For improved data and computation locality, such kernels can be split into multiple image stripes which allows the accelerator to exploit its internal memory hierarchy by only keeping the currently processed image stripes in its fast L1 SPMs. Using double-buffered DMA transfers, the memory copies between the L1 SPMs and shared main memory can further be overlapped with the 3.1. MOTIVATING EXAMPLE 35

Input Kernel Output Stripe 0 for (i=0; i

Figure 3.1: Example application kernel to be offloaded to a PMCA. actual computation to hide the high latency of main memory accesses, which makes such processing schemes highly efficient. Moreover, the access patterns to shared main memory are flexible, controllable and known at offload time, which can be exploited for efficient SVM. To reduce the design complexity, area and power consumption compared to a full-fledged hardware IOMMU as found in high-end and HPC systems, we propose a lightweight mixed hardware-software solution for SVM support that is based on an IOTLB managed by a kernel-level driver running on the host processor. Knowing the access pattern to shared main memory allows 1) to prepare the shared memory pages at offload time, i.e., when preparing the offload to the accelerator, 2) to update the IOTLB ahead of time based on the progression of the accelerator, and 3) to overlap the reconfiguration of the IOTLB entries with actual computation and DMA transfers between shared main memory and the L1 SPMs without stalling kernel execution on the accelerator. Fig. 3.2 shows how the computation of such a kernel can be mapped to an PMCA including the IOTLB configuration required at the beginning of each execution phase. At the beginning of Phase N, the PMCA sets up the DMA transfers to

Host Host Kernels Time

PMCA O✁oad

DMA Transfers Stripe 0 In Stripe 1 In Stripe 0 Out Stripe 2 In Stripe 1 Out Stripe 3 In PMCA Kernel Stripe 0 Stripe 1 Stripe 2

IOTLB Entries 0 In 0 In Update for Phase 1 2 In Used during Phase 1 2 In Update for Phase 3 X In 1 In Used during Phase 0 1 In Update for Phase 2 3 In Used during Phase 2 X Out X Out Update for Phase 1 0 Out Used during Phase 1 0 Out Update for Phase 3 X Out X Out Not used yet X Out Update for Phase 2 1 Out Used during Phase 2

Phase 0 Phase 1 Phase 2

Figure 3.2: Computation, memory access pattern and IOTLB configuration of a typical data-parallel accelerator kernel split into stripes. 36 CHAPTER 3. LIGHTWEIGHT SVM copy out the data just produced (Stripe N-1 Out) and to copy in the input data for Phase N+1 (Stripe N+1 In). Thus, it must just be ensured that the IOTLB contains the virtual-to-physical address translations for Stripe N-1 Out and Stripe N+1 In while the PMCA is in Phase N and performs the computations for Stripe N. By sending an interrupt to the host at the beginning of every execution phase, the state of the PMCA execution can be tracked. The host then starts to reconfigure the IOTLB entries used in the previous Phase N-1 for the next Phase N+1 while the PMCA is in Phase N. The reconfiguration needs to be done at the latest at the beginning of Phase N+1. Otherwise, the PMCA must wait for the IOTLB to be updated before it can set up the next DMA transfers Stripe N Out and Stripe N+2 In. This double-buffering scheme for the IOTLB allows to overlap inter- rupt latency and IOTLB reconfiguration with actual DMA transfers to and from SVM, similar to the double-buffering scheme for the L1 SPMs, which allows to overlap DMA transfers with computations performed on the L1 SPMs. Also, the granularity of the IOTLB double buffering can be varied to trade off synchronization overhead and hardware resources. The host can set up the IOTLB for multiple stripes per phase to reduce the number of interrupts and the total interrupt latency at the cost of a larger IOTLB with more entries. Alternatively, the execution phases can be chosen such that every DMA transfer just touches a single memory page for highly compute-bound PMCA kernels. This allows to reduce the IOTLB size to just two entries per kernel.

3.2 Related Work

Undoubtedly, SVM eases program writing and compiler implemen- tations for offload techniques. Today’s implementations found in GPGPU-based, heterogeneous, high-end SoCs [6, 7] and HPC systems using discrete GPGPUs [10,11] or FPGAs [2–4] rely on dedicated hard- ware blocks such as IOMMUs tightly integrated into the accelerator architecture and suitable software frameworks to allow the accelerator to handle addresses in paged virtual memory. The situation is different in the context of embedded systems. On one hand, it is doubtful whether the costly hardware solutions from the HPC domain are also 3.2. RELATED WORK 37 suitable for embedded systems with much tighter area and power constraints. On the other hand, while the next-generation, high-end HESoCs [27] indeed feature full-fledged IOMMU hardware [28], the available software stacks limit the use of the hardware to protect the host from malicious or faulty DMA devices and drivers (see Sec. 1.3). The hardware is configured by the host on a DMA transfer basis alongside setting up the actual transfer through the DMA API of Linux [29]. This means that the accelerator must strictly rely on the host for data management, which is not sufficient to enable efficient data sharing between host and accelerator. Besides designing full-featured hardware IOMMUs targeting full accelerator virtualization with fast context switching in embedded, high-performance SoCs [64], the research community has also come up with different and more lightweight proposals for SVM in HESoCs that allow the accelerator itself to initiate the DMA transfers [65–67]. However, the proposed designs still rely on the host to prepare a set of DMA descriptors and/or dedicated page tables at offload time. Moreover, to avoid excessive overheads at accelerator run time, e.g., due to translation lookaside buffer (TLB) reconfiguration, additional, dedicated hardware is introduced. In addition, specialized memory allocators and/or physically contiguous memory must be used, and the range of addresses accessed by the accelerator must be known to the host at offload time. Nevertheless, these designs are suitable for a wide range of applications enabled by today’s state-of-the-art heterogeneous systems and the associated programming paradigms focusing on a data-parallel accelerator model. Under these paradigms, offload-based applications are typically structured as a set of regular loops containing predictable trip count and memory access patterns, and are running for a very large number of fully-independent iterations. The same characteristic is already exploited to implement efficient data transfers into L1 SPMs of the accelerator’s computing clusters using double-buffered DMA transfers. Several compiler-level techniques have been proposed to automatically infer an optimal tiling of the parallel workload and the data accessed therein [20, 68]. Similar to other proposals by the research community [65–67], our design presented in this chapter relies on the host to pin the shared memory pages and to create an address remapping table at offload time. However, our design does not rely on physically contiguous 38 CHAPTER 3. LIGHTWEIGHT SVM memory or specialized memory allocators, which eases programmability. Instead of adding dedicated hardware to reconfigure the IOTLB upon misses based on the table generated by the host, our design uses an IOTLB double-buffering scheme to let the host efficiently perform the management without IOTLB misses ever happening, which helps to reduce design complexity. Efficient TLB management is also key to circumvent the TLB bottleneck faced in MMUs of modern multi-core processors. In contrast to the IOTLB double-buffering scheme presented in this chapter, the TLB prefetching mechanisms typically employed in such designs [69,70] need to be of low complexity to be efficiently implemented in hardware. They do not allow to extract information about the memory access patterns on the application level ahead of time, which is key for our design.

3.3 Infrastructure

This section is about our evaluation platform, i.e., our embodiment of the target hardware architecture template presented in Sec. 1.1, and the software we developed to enable lightweight virtual memory support for the PMCA. Our evaluation platform is based on the Xilinx Zynq-7000 All-Programmable SoC [55]. Fig. 3.3 gives an overview of the platform. The Zynq can be divided into two main parts used to implement the HESoC.

3.3.1 Host

The PS features interconnects, peripherals, a DRAM controller, and the APU with a dual-core ARM Cortex-A9 CPU at its heart. Each core has separate L1 instruction and data caches with a size of 32 KiB each. Further, the APU offers 512 KiB of shared L2 instruction and data cache that connects to the high-priority port of the DRAM controller. Our evaluation platform uses the PS to implement the host of the HESoC and is running Xilinx Linux 3.13. 3.3. INFRASTRUCTURE 39

Host PMCA

Cluster L1 SPM L1 SPM L1 SPM A9 Core 0 A9 Core 1 L2 DMA Bank 0 Bank 1 Bank M-1 SPM MMU MMU X-Bar Interconnect L1 I$ L1 D$ L1 I$ L1 D$

Mailbox Cluster Bus

Coherent Interconnect SoC Bus PE PE PE #N-1

#0 L1 I$ #1 L1 I$ L1 I$ System Periph Interc Periph DMA L2 $ RAB Periphs Instruction Bus IRQs

32-bit AXI GP0 64-bit AXI HP0 DDR DRAM Controller Remapping Address Block (RAB) managed by kernel-level driver module on host for lightweight virtual memory support.

DDR DRAM Xilinx Zynq-7000 All-Programmable SoC

Figure 3.3: The evaluation platform with host and PMCA implemented in the PS and PL of the Xilinx Zynq SoC, respectively.

3.3.2 PMCA

The PL consists of a Xilinx Kintex-7 FPGA, which is used to implement PULP: a PMCA developed as an application-specific (ASIC) for Parallel Ultra-Low Power Processing [71]. To overcome scalability limitations, PULP leverages a multi-cluster design. The PEs within a cluster feature 8 KiB of private L1 instruction cache, and share 40 KiB of L1 data SPM. The L1 SPM banks are connected to the PEs through a low-latency crossbar interconnect with a word-level interleaving scheme to minimize access contention. Ideally, every PE can access one word in the L1 SPM per cycle. The L1 SPMs of all clusters as well as the globally shared L2 SPM with a size of 64 KiB are mapped in a global, physical address space, meaning that the PEs can also access data in other L1 SPMs, albeit with a higher latency. Every cluster features a lightweight, low-programming-latency, multi-channel DMA engine, which allows for fast and flexible movement of data between L1 and L2 SPMs memory or shared DRAM [72]. PULP is attached to the host as a memory-mapped device and controlled by a kernel-level driver module and a user-space runtime as described in Sec. 3.3.4. 40 CHAPTER 3. LIGHTWEIGHT SVM

3.3.3 Remapping Address Block

The host and the PMCA support interrupt driven communication using a mailbox as well as communication through SVM. To this end, the PMCA features a multi-ported remapping address block (RAB) which connects the SoC bus of PULP with the DRAM controller of the host as shown in Fig. 3.3. The RAB is basically a software-managed IOTLB, which is used by the PMCA to translate virtual, host addresses of outgoing memory transactions to physical addresses. As opposed to a full-fledged IOMMU [28, 64], the RAB does not feature dedicated prefetching and page table walk (PTW) hardware to guess the virtual addresses of upcoming transactions and to do the corresponding virtual-to-physical address translation ahead of time or in case of an IOTLB miss, respectively. Doing without prefetching and PTW hardware and associated caches often used to speed up these operations allows to substantially reduce design complexity of the RAB, which is key for embedded systems. The setup of the IOTLB is completely done in software by the kernel-level driver module and a user-space runtime running on the host as described in Sec. 3.3.4. Software-based IOTLB management can be substantially slower compared to a hardware solution. However, many application kernels offloaded to PMCAs offer highly flexible and controllable access patterns to shared main memory. In fact, the addresses in main memory as well as the order in which these are accessed are known at offload time in many cases. Exploiting this information can make software-managed IOTLBs highly attractive solutions to enable lightweight SVM for many-core accelerators in HESoCs. Fig. 3.4 shows a schematic of the RAB. It supports multiple ports, each of which comprises private AXI4 master and slave interfaces, and a private IOTLB with a parameterizable size implemented using a fully- associative content-addressable memory (CAM). The selected RAB design is optimized for low latency and high flexibility. As opposed to a conventional IOTLB, each entry of the RAB, which we define as a RAB slice, can hold one arbitrary-sized mapping, independent of the page size of the host. To this end, also the virtual end address needs to be stored besides the virtual and the physical start address, and the protection flags, which increases the size of the CAM. To 3.3. INFRASTRUCTURE 41

AXI4-Lite Slave Host IOTLB Virtual Address Physical Address Transaction Size Transaction Valid Transaction Valid Miss IRQ er

PMCA ff DRAM

AXI4 Slaves Bu AXI4 Masters

Figure 3.4: Schematic of the RAB with two ports. set up the slices of the individual RAB ports, the host uses a shared AXI4-Lite configuration interface. If a new memory transaction arrives at an AXI4 slave interface, the requested virtual address as well as the transaction size are fed to the corresponding IOTLB to check if one of the slices is holding a valid configuration for the transaction. If this is the case, the IOTLB returns the corresponding physical address which is then used to issue the translated memory transaction at the master interface. If no valid configuration is found or if the protection flags do not allow the requested transaction, it is simply dropped and an interrupt is sent to the host. The CAM itself has a look-up latency of one clock cycle and the RAB can process one request per cycle per port.

3.3.4 Software The RAB is managed completely in software by a kernel-level driver module interfaced by the offload runtime running on the host. Moving the management of host and accelerator memory coherency from the application developer’s hands down to the low-level accelerator runtime and kernel-level driver allows for improved programmability. The application developer can then focus on writing the heterogeneous application and specify the parts to be offloaded to the PMCA using, e.g., OpenMP offload directives [61]. How the different software layers interact with each other is shown in the following. The labels refer to different time instants marked in Fig. 3.5. 1 Initialization: When the heterogeneous application is started on the host, it must first reserve those addresses in its own virtual address space that overlap with the physical address space of the PMCA. This step is required to make sure that the host never passes 42 CHAPTER 3. LIGHTWEIGHT SVM a virtual address pointer to the PMCA that overlaps with its own address space, since the PMCA would not route an access to such a virtual address through the RAB but instead route it internally, e.g., to its internal memories or memory mapped configuration registers. To this end, the runtime uses an mmap() system call with the MAP FIXED, MAP ANONYMOUS and PROT NONE flags to get exactly the specified address segment, to not back the mapping with any physical memory and to also not contribute to the overcommit limit of the kernel. Further, the application calls the kernel-level driver to map control registers and the mailbox to user-space addresses using mmap() system calls which allows for efficient host runtime-to-PMCA communication without context switches. 2 PMCA Start: The runtime opens the PMCA binary and passes its address to the kernel-level driver using an ioctl() system call. The driver then locks the pages containing the PMCA binary in memory using get user pages() function, does the virtual-to- physical address translation, flushes the pages from the caches, sets up the DMA transaction descriptors and passes them together with callback parameters to the Linux DMA API. Based on the transaction descriptors, the Linux DMA API then interfaces the hardware driver of the system DMA engine of the Zynq to start copying the binary to the L2 SPM of the PMCA. The callback parameters are used to unlock the user-space pages after the DMA transfers have finished, which allows the driver to return control to the runtime right after calling the Linux DMA API. 3 Offload - Runtime: When scheduling an offload, the runtime calls the driver to do the virtual-to-physical address translation and to set up the RAB since user-space applications have no idea of physical memory addresses at all. Based on the size of the shared data elements and the PMCA resources, a different RAB setup mechanism is used. For data elements fitting in the L1 SPM of the PMCA, the runtime tries to order elements according to their virtual addresses to minimize the number of context switches and ioctl() calls to the driver for the RAB setup requests. For data elements that do not fit in the L1 SPM, and therefore need to be accessed in a striped manner, the runtime extracts the virtual start and end addresses of the stripes, which are then passed to the driver when requesting the striped remappings using a single ioctl() system call for all striped data elements. 3.3. INFRASTRUCTURE 43

4 Offload - Driver: When a new remapping is requested, the kernel-level driver first locks the corresponding user-space pages in memory using get user pages(). Second, it does the virtual-to- physical address translation and infers the number of RAB slices required for the remapping by grouping the locked pages into physically contiguous segments. Then, it checks for the availability of RAB slices and if necessary, frees slices and unlocks the corresponding pages based on time codes provided by the runtime. Next, the remapped pages are flushed from the data caches of the host back to DRAM. Finally, the driver sets up the RAB slices through the configuration port of the RAB, which is remapped to kernel space using ioremap nocache(). The setup for striped data elements follows a similar pattern, except that the individual steps have to be done separately for every data element. Moreover, after extracting the physical addresses of the locked pages, the driver needs to set up a data structure containing the virtual and physical start addresses of every physically contiguous segment as well as the end addresses of all stripes. For every striped data element, the driver requests twice the maximum number of slices required to remap a stripe for the double buffering of the RAB. Kernel Execution: After setting up the RAB, the runtime starts the kernel execution on the PMCA by passing the virtual address pointers of the shared data elements as well as configuration parameters to the PMCA using the mailbox 5 . The PMCA can then take over control, set up the DMA transfers to load the shared data elements or their first stripes into its L1 SPM. As shown in Fig. 3.5 a), whenever the PMCA starts a new compute phase, it issues a RAB update request to the host using the mailbox 6 . This triggers an interrupt in the host. Since the slices required for the DMA transfers of the current phase are already set up, the PMCA does not have to wait for a response from the host. It can set up the DMA transfers and continue kernel execution. 7 RAB Updating: The OS kernel running on the host registers the interrupt, preempts the currently running task and schedules the interrupt handler of the kernel-level driver module to run and handle the interrupt. Based on the data structures previously set up when handling the remapping request and the requested update type, the interrupt handler can then set up the slices for the next phase. Besides simply proceeding to the next phase, the PMCA can also request for other RAB update types, e.g., to start all over with one shared data 44 CHAPTER 3. LIGHTWEIGHT SVM

1 2 3 4 7 7 7 8 Host Init Host Kernels Out In Time

PMCA O✁oad 5 DMA Transfers Stripe N-2 Out Stripe N In Stripe N-1 Out Stripe N+1 In Stripe N Out Stripe N+2 In PMCA Kernel Stripe N-1 Stripe N Stripe N+1 6 6 6 a) RAB Slices N-1 In Update for Phase N N+1 In Used during Phase N N+1 In Update for Phase N+2 N In Used during Phase N-1 N In Update for Phase N+1 N+2 In Used during Phase N+1 N-3 Out Update for Phase N N-1 Out Used during Phase N N-1 Out Update for Phase N+2 N-2 Out Used during Phase N-1 N-2 Out Update for Phase N+1 N Out Used during Phase N+1

Phase N-1 Phase N Phase N+1 6 6 b) RAB Slices N-1 In N+2 In N-2 In Update for Phase X+1 N+3 In Used during Phase X+1 N In N In N+1 In Used during Phase X N+1 In Update for Phase X+2 N-4 Out N Out N-3 Out Update for Phase X+1 N+1 Out Used during Phase X+1 N-1 Out N-1 Out N-2 Out Used during Phase X N-2 Out Update for Phase X+2

Phase X Phase X+1 6 6 6 6 6 6 c) RAB Slices N-1 In N In N In N+1 In N+1 In N+2 In N-2 Out N-2 Out N-1 Out N-1 Out N Out N Out

Phase Y Phase Y+1 Phase Y+2 Phase Y+3 Phase Y+4 Phase Y+5

Figure 3.5: Memory access pattern and RAB configuration for different granularities of the RAB double-buffering scheme: One stripe per phase a), two stripes per phase for higher performance b), and one DMA transfer per phase for smaller RAB configuration.

element, which allows for more complex memory access patterns. For every shared data element, the interrupt handler first deactivates the slices remapping the input buffers of the current processing Phase N (the data has already been copied to the L1 SPM of the PMCA), and sets up the slices which are going to be used by the PMCA DMA during the next Phase N+1, i.e., to copy in the data for Phase N+2. This double-buffering scheme allows to overlap the interrupt latency and the reconfiguration of the RAB with actual DMA transfers to and from SVM. Its granularity can be varied to trade off synchronization overhead and hardware resources. For example, the host can set up the RAB for multiple stripes per phase to reduce the total interrupt latency at the cost of a larger RAB with more slices. This can be beneficial for PMCA kernels that are highly memory bound. This case is shown in Fig. 3.5 b). Similarly, the execution phases can be chosen such that every DMA transfer just touches a single memory page for highly compute-bound PMCA kernels. This allows to reduce 3.4. EXPERIMENTAL RESULTS 45 the number of RAB slices to just two slices per kernel at the cost of more update requests. It is visualized in Fig. 3.5 c). When the RAB has been updated, a signal is sent to the PMCA using the mailbox, which is also remapped to kernel space using ioremap nocache(). Before the PMCA proceeds to the next phase, it waits until the mailbox contains the RAB update confirmation sent by the driver. 8 Cleanup: When the PMCA has finished computation, it informs the runtime running on the host using the mailbox. The runtime then again uses an ioctl() system call to let the driver free the RAB by deactivating the slices, marking modified pages as dirty, and releasing the pages from the page cache using page cache release(). Finally, the driver can clean up the data structures created to store the information about the striped data elements. It is worth noting that, while it could make sense to lock and flush the shared user-space pages to DRAM only at the time they are needed, i.e., when a RAB update is requested to set up a slice remapping a particular page, such solutions are inexpedient. Although this could indeed speed up the offload operation, it heavily affects the RAB update mechanism. More precisely, some of the kernel’s functions required for locking the user-space pages can sleep and may therefore not be executed in interrupt context. This would require the interrupt handler to schedule the RAB update at a later, safer time which introduces an additional, unpredictable delay. Instead, the RAB update mechanism needs to be fast, efficient and predictable to guarantee that the PMCA does not need to be stalled.

3.4 Experimental Results

Sec. 3.4.1 gives some details of our evaluation platform including information about the hardware complexity of the RAB. We use syn- thetic workload to characterize the performance of our RAB updating mechanism, and study the effectiveness of our solution in Sec. 3.4.2. In Sec. 3.4.3, we compare the offload performance of three collaborative workloads, i.e., applications that are started on the host and feature highly parallelizable parts which can be executed on the PMCA, with 46 CHAPTER 3. LIGHTWEIGHT SVM and without support for SVM and quantify the performance benefit of SVM in HESoCs.

3.4.1 Experimental Setup

Our evaluation platform is based on the Digilent Mini-ITX development board [73] featuring a Xilinx Zynq-7045 SoC [55]. The host and the PMCA are clocked at 667 MHz and 75 MHz, respectively. The main goal of this platform is to study and evaluate the system-level integration of a PMCA into an HESoC. Hence, we did not optimize the PMCA for the implementation in FPGA logic. In fact, the FPGA implementation should more be seen as an emulator instead of a fully featured accelerator. The PMCA has a single cluster comprising four PEs, which share 40 KiB of L1 SPM. Together with the host, it shares 1 GiB of DDR3-1066 DRAM with a peak transfer rate of 8.5 GB/s. The RAB uses a single port equipped with 32 slices for PMCA-to-host communication. Tbl. 3.1 gives an overview of the resource utilization of the eval- uation platform. In terms of look-up tables (LUTs) and flip-flops (FFs), the resource utilization of the RAB is about 15% of the resource utilization of a PULP cluster with four PEs. Further, the table lists the resources of a full-featured IOMMU [64]. This IOMMU features a 64-entry TLB and has been implemented on a XC6VLX760 device running at 225 MHz. Note that, while the RAB consumes 52% of the logic resources (LUTs) of the IOMMU, it requires less than 1% of its memory resources and copes without BRAMs. The total resource utilization of the RAB is a fraction of the resources used by the full-featured IOMMU. It is worth noting that the CAM of the IOMMU has an access latency of 6 cycles, which allows to reduce its complexity. In contrast, the CAM of the RAB has a look-up latency of 1 cycle. Moreover, the RAB allows for mappings of arbitrary size, independent of the page size. Constraining the RAB to page-sized mappings only and relaxing the access latency of the CAM would allow to further reduce the area and increase the maximum clock frequency of the implementation. 3.4. EXPERIMENTAL RESULTS 47

Table 3.1: FPGA resource utilization.

Block LUTs [k] FFs [k] BRAM [kbit] PULP Cluster 41.0 21.0 590 RAB 5.8 3.5 0 IOMMU [64] 11.2 408

3.4.2 Synthetic Workload To verify the feasibility of our lightweight SVM solution, we measured the execution time of the RAB management mechanisms running on the host. This information can be used to derive a minimum operational intensity for offloaded kernels to cope with a given number of RAB slices without stalling the PMCA. To make sure that the PMCA never has to wait for a RAB update, it must be guaranteed, that the execution time of the PMCA between requesting two RAB updates is greater than or equal to the time it takes the host to do the RAB update:

tc · ns,u ≥ tr + tu · ns,u (3.1) where tc is the computation time of the PMCA per slice, ns,u is the number of slices to be updated per request, tr is the response time of the host, i.e., the time it takes until the interrupt handler in the kernel-level driver starts to handle the update request, and tu is the time it takes for the driver to perform the update of one slice. (3.1) can be solved for the number of slices to be updated per request

tr ns,u ≥ . (3.2) tc − tu In the case of ideal parallel processing, the computation time per slice can be given as

ncycles,s tc = (3.3) ncores · fclk where the number of cycles per slice ncycles,s are a measure of the operational intensity of the application of interest, and the number 48 CHAPTER 3. LIGHTWEIGHT SVM

100

80 Total Setup (per Page)

60 Cache Flush (per Page)

40 Page Locking (per Page) Response Time Idle 20 RAB Update (per Slice)

Average Execution Time [us] Address Translation (per Page) Cleanup (per Page) 0 0 0.25 0.5 0.75 1 Cache Miss Rate

Figure 3.6: Average execution time of the RAB management functions for different host load.

of cores ncores and the clock frequency fclk characterize the PMCA at hand. Finally, the actual number of required RAB slices ns is twice the number of slices to be updated per request ns,u due to the double buffering, and can be given as 2 t ≥ r ns ncycles,s . (3.4) − tu ncores·fclk

The response time of the host tr and the time per slice update tu mainly depend on the kernel-level driver and the RAB. We measured them using a parameterized synthetic benchmark which sets up the sharing of some data elements. The PMCA, after waiting for some time, resets and starts its clock cycle counter and requests an update of the RAB. Once the driver is scheduled to handle the interrupt, it stops and reads the clock cycle counter inside the PMCA, updates the RAB as requested, and informs the PMCA that the RAB is ready. To measure the performance of the driver under different load and bus utilization conditions of the host, the synthetic benchmark described in Sec. 2.3.1 was used to force a fixed cache miss rate on both host processor cores. The results are shown in Fig. 3.6. The execution time of the functions executed in interrupt context, i.e., the response time tr and the slice update time tu, is stable also for heavy load and high bus utilization on the host side. The response time 3.4. EXPERIMENTAL RESULTS 49

tr is lowest for 0% cache miss rate. The fact that it is even lower than when the host is idle (see marker), may be a result of the CPU driver which reduces the clock frequency of the host to save power under low load. For cache miss rates greater than zero, tr rapidly increases to its maximum value of around 21 µs. This is probably due to the fact, that the host processor is stalled due to the cache miss and must wait for the response before the context switch can happen to handle the interrupt. The functions called at offload time to lock the user-space pages containing the shared data elements, and to flush the corresponding cache lines back to DRAM heavily depend on the bus utilization and the available DRAM bandwidth. The time to unlock the pages after the PMCA has completed execution as well as the time to do the virtual-to-physical address translation is negligible. In fact, the actual PTW is done when locking the user-space pages. The data obtained from these measurements can now be used to compute the minimum number of RAB slices for maximum performance on a given PMCA as a function of the operational intensity using (3.4). To study the feasibility of our solution, we evaluated the operational intensity of the following real application kernels: • Strassen: Multiplication of two 30 × 30 matrices; • Support vector machines (SVM): a algorithm, e.g., used for gesture recognition in wearable electronics [74]; • Histogram of oriented gradients (HOG): a feature descriptor used for detecting humans in computer vision applications [75]; • Normalized cross-correlation (NCC): a kernel used to detect abounded and/or removed objects in security-critical areas [76]; • Color structure code (CSC): a kernel used for color segmentation and tracking in image streams (based on the standard OpenCV library implementation [77]); • Dequantization (DEQ) and Inverse discrete cosine transform (IDCT): two kernels used for MJPEG decoding (based on the implementation for MiBench [59]). Tbl. 3.2 gives the total number of cycles executed by all the PEs of the PMCA together, the amount of data transferred by the DMA 50 CHAPTER 3. LIGHTWEIGHT SVM

Table 3.2: PMCA application kernels.

Kernel Cycles [M] Transfer Size [KiB] Slices Min. Slices Strassen 0.6 10.6 12 8 SVM 7.5 2.2 12 2 HOG 146.0 11.0 12 2 NCC 20.8 10.5 12 2 CSC 3.4 12.0 8 2 DEQ, IDCT 0.9 8.0 6 2 All numbers are given on a per stripe basis. engine, and the number of RAB slices used. All the numbers have been measured on the evaluation platform and are given on a per stripe basis. The number of slices depends on the host page size, the amount of data to transfer per stripe, and the number of data elements. For example, NCC needs to get 4 KiB from two input data elements to produce 2.5 KiB of output data. In most cases, the required or produced data of each element is not page aligned and touches two 4 KiB pages in main memory, which requires up to 6 slices to let the PMCA access 6 memory pages during the computation of one stripe and request a single RAB update per stripe. Because of the double-buffering scheme, this number needs to be doubled to give a total number of 12 slices. If the kernel is sufficiently compute-bound, the number of slices can be reduced by requesting multiple RAB updates per stripe and splitting the DMA transfers into segments accessing a single slice only. The minimum number of slices required by employing this technique is given in the last column of the table. While this technique allows to reduce the number of slices for most of the kernels to the absolute minimum of 2 slices, it must always be guaranteed that the time to do all the updates is smaller than the computation time. We estimated the minimum operational intensity for a full-fledged hardware IOMMU as found in some high-end systems (without prefetch- ing) as well as for a software-managed IOMMU without IOTLB double buffering and setup of the shared pages at offload time. To avoid stalling the PMCA in these cases,

tc ≥ tdma + tmh (3.5) 3.4. EXPERIMENTAL RESULTS 51 must be fulfilled. The time it takes the DMA engine to transfer the data mapped by a slice tdma can be approximated by the time it takes to transfer one 4 KiB page. The time it takes to handle a miss in the IOTLB tmh we approximated as follows. The hardware IOMMU must first do a PTW to do the virtual-to-physical address translation. Second, it must make sure that the memory page is flushed out of the caches of the host. To estimate the time a full-fledged IOMMU requires to do the PTW, we used Calibrator [78] to measure the time it takes the host processor to handle a TLB miss (≈ 19 ns). Compared to the time required to flush a memory page out of the caches (≈ 14 µs best case), the time to do the PTW is negligible. The software-managed IOMMU cannot do the PTW itself and instead needs to call the kernel-level driver using an interrupt. The miss-handling time is therefore

tmh = tr + tlock + tflush + tu (3.6) where tlock and tflush are the times required to lock the page and flush the page out of the caches, respectively. Note that some functions required to lock the pages in memory and to do the cache flushes may sleep and therefore cannot be called from an interrupt context. Instead, the interrupt handler must schedule the real work to be done at a later time which adds another load-dependent delay to (3.6). As opposed to our evaluation platform, full-featured many-core accelerators offer more PEs that can run at higher clock frequencies. For example, STHorm [15] features 16 PEs per cluster running at up to 450 MHz. Fig. 3.7 shows the minimum number of RAB slices required by our solution (stepped curve) depending on the number of computation cycles per slice, i.e., the operational intensity, for a PMCA with 16 PEs running at 450 MHz with tr = 21 µs and tu = 9.3 µs (worst case) according to (3.4). The minimum number of cycles per core per slice to avoid stalling the PMCA without prefetching is denoted by the dashed lines for the hardware IOMMU (HW IOMMU ) and the software-managed IOMMU (SW IOMMU ) according to (3.5). The figure also shows the application kernels from Tbl. 3.2. The values for the x-coordinate are obtained by dividing the numbers of cycles from the table by the number of PEs (16) and half the numbers of slices from the table, i.e., the numbers of slices used to transfer the data for one stripe. 52 CHAPTER 3. LIGHTWEIGHT SVM

32 HW IOMMU SW IOMMU 28 Decrease granularity to avoid PMCA stalls.

24

RAB Double Bu✁ ering (3.4) SVMs 20 (78,12) NCC (216,12) 16 HOG Strassen (1520,12) (6,12) 12 Increase Granularity CSC (53,8)

#RAB Slices for 8 DEQ,IDCT Max. Performance (19,6) 4 Increase Operational intensity Granularity and number of slices too low. PMCA Stalls 0 5 10 15 20 25 Operational Intensity [1,000 Cycles per Core per Slice]

Figure 3.7: Minimum number of RAB slices for maximum performance, i.e., no accelerator stalls, as a function of the operational intensity for 16 PEs running at 450 MHz.

Kernels located to the left of any of the two lines or below the stepped curve do not offer a sufficiently high operational intensity to allow for execution at full speed with the corresponding solution for SVM. It can be seen that for kernels requiring 14 k cycles per core per slice and more, our scheme even allows to use a minimum of 2 slices by increasing the granularity of the RAB double buffering. This can be achieved by splitting the DMA transfers into segments accessing the same slice and requesting a RAB update after every segment. The last column in Tbl. 3.2 gives the minimum number of slices for each of the kernels for this case. For kernels requiring 23 k cycles per core per slice and more, even a software-managed IOMMU would be fast enough to not stall the PMCA, if the host is not heavily loaded. However, using a faster PMCA with more cores will bring the applications closer to the critical range where either a full-featured hardware IOMMU or efficient software-management as enabled by our design is required to avoid PMCA stalls. For highly memory-bound application kernels like Strassen, our mixed hardware-software solution is even able to outperform full- fledged hardware IOMMUs. For kernels located to the left or below 3.4. EXPERIMENTAL RESULTS 53 the stepped curve, our solution can decrease the granularity of the RAB double-buffering scheme. More precisely, by using additional slices to set up several stripes per update request, the interaction and synchronization overheads between host and PMCA can be reduced. For example, a kernel offering 5 k cycles per core per slice and requiring at least 16 slices cannot be executed without stalling the PMCA for updating the RAB. But by simply updating the RAB configuration for two subsequent stripes per update request, which requires to double the number of slices to 32, it can be guaranteed that the PMCA never has to wait for a RAB reconfiguration to finish and can execute the kernel at maximum speed. With 32 slices, our solution allows to update the RAB in time for application kernels featuring down to 4,800 cycles per core per slice which corresponds to an operational intensity of approximately 1.2 operations per core per byte.

3.4.3 Collaborative Workloads In modern HESoCs, workloads are typically executed in a collaborative manner between the host and the accelerator. In this configuration, the application is started on the host, which executes mainly sequential parts or parts that offer limited parallelism only. When a point in the program is reached where high degrees of parallelism are available, an offload operation transfers the execution control to the accelerator [61]. If the PMCA has no support for SVM, the host cannot simply pass virtual address pointers to the PMCA. Instead, it needs to copy the shared data elements from their original place in paged virtual memory into a physically contiguous, nonpaged, and uncached address range. After the PMCA has finished execution, the results of the computation need to be copied back into the cached virtual memory. To quantify the benefits of virtual-memory enabled, zero-copy offloading for PMCAs in HESoCs, we have implemented the three collaborative workloads presented in Sec. 2.3.4 on our evaluation platform. We measured the execution time of the offload operation on our evaluation platform both with and without SVM support. Fig. 3.8 shows the average offload time per frame for the three applications. It can be seen that our mixed hardware-software solution for lightweight SVM allows to speedup the offload operation by a factor of 2.1, 2.7 and 3.4 for ROD, CT and MJPEG, respectively. 54 CHAPTER 3. LIGHTWEIGHT SVM

8 7 6 5 2.7x Speedup 4 3 2.1x 2

Offload Time [ms] 3.4x

Per-Frame Average 1 0 ROD CT MJPEG Copy-Based Shared Memory Shared Virtual Memory

Figure 3.8: Average offload time per frame with and without SVM support. Refer to Sec. 2.3.4 for a description of the applications.

3.5 Summary

In this chapter, we have presented a mixed hardware-software solution to enable lightweight SVM support for simplified programmability and improved offload performance in HESoCs. It allows to speedup the offload operation for real heterogeneous applications by a factor of up to 3.4. Using an IOTLB with 32 entries, our solution allows a PMCA with 16 PEs running at 450 MHz to execute application kernels with operational intensities down to 1.2 operations per core per byte at maximum speed, i.e., without stalling the PMCA. Our solution is based on a simple hardware module, the RAB, efficiently managed by a kernel-level driver module and a user-space runtime running on the host. Thanks to the presented IOTLB double- buffering scheme, our solution allows to overlap the host interrupt latency and IOTLB reconfiguration with actual data transfers and accelerator execution. By adjusting the granularity of the double- buffering scheme, our solution can be tuned to the operational intensity of the targeted application kernels and to trade off hardware resources and synchronization overhead. For example, it can be made to cope with a minimum of two IOTLB entries for compute-bound kernels. For memory-bound kernels, a larger IOTLB can be combined with a reduced granularity of the double-buffering scheme to avoid stalling the accelerator, which even allows to outperform full-fledged hardware IOMMUs at a fraction of their resource cost. 3.5. SUMMARY 55

To further improve the performance, the design can be extended to support coherent interconnections such as the Accelerator Coherency Port (ACP) of modern HESoCs, which allows the PMCA to access the most recent data copies directly from the caches of the host. On the host side, this eliminates the need for cache flushes, which is the dominant part during the offload operation. In addition, after locking the shared memory pages and generating the update table, the host could move the management of the IOTLB to the PMCA itself to reduce the synchronization overhead and further reduce the minimum operational intensity required for maximum performance. The presented design is suitable for regular, streaming-type access patterns to shared main memory as exhibited by many of today’s application kernels typically offloaded to data-parallel accelerators. However, for applications operating on pointer-rich data structures, for which SVM would tremendously ease the implementation on HESoCs, the design is not applicable. On one hand, the access patterns of such applications to shared main memory are typically irregular, (input) data dependent and thus not known at offload time. On the other hand, the shared data structures are typically large. Extensive locking of all the shared memory pages when performing an offload not only leads to a considerable offload-time overhead, but is also potentially wasteful, e.g., if just a small part of the shared data structure is actually accessed by the accelerator. The design of a SVM system for accelerators suitable for this type of workload is addressed in the remaining chapters of this thesis.

Chapter 4

Sharing Pointer-Rich Data Structures

The lightweight SVM solution presented in Chap. 3 as well as other designs proposed by the research community [65–67] are suitable for many classes of applications exhibiting regular memory access patterns, for which workload partitioning strategies that are amenable to DMA transfers can be identified relatively easy. However, other types of applications adopt completely irregular memory access patterns based on the traversal of complex data structures such as trees and linked lists, which imply data-dependent accesses to shared memory often impossible to predict statically. Still, such applications can offer high degrees of parallelism. In fact, they represent typical workloads from the ”big data” application space, including graph processing, that is prominent in data centers, and where heterogeneous system architectures with configurable accelerator frameworks [2–4] continue gaining traction because of their energy efficiency. While these frameworks available only for HPC systems drastically ease the job of the application programmer and allow for substantial offload speedups [79], the lack of suitable SVM infrastructure impedes the implementation of such applications on HESoCs. Without support for SVM, the only option for the programmer is to resort to physically contiguous memory. The host processor must be instructed to copy

57 58 CHAPTER 4. POINTER-RICH DATA STRUCTURES the entire data structures into a physically contiguous, nonpaged, uncached memory section at offload time. On top of that, any virtual address pointer inside the data structure must be adjusted to point to the data copy in physically contiguous memory. This cannot be done using a DMA engine but requires the host to traverse the entire data structure at offload time. This not only hampers programmability, but also kills performance. In this chapter, we present a mixed hardware-software SVM frame- work enabling zero-copy sharing of pointer-rich data structures between host processors and PMCAs in HESoCs. The design is not intrusive to the architecture of the PMCA PEs and is thus suitable for sharing virtual address pointers between host and PMCAs without inherent support for SVM. It consists of a hardware IOTLB managed by a OS kernel-level driver module running on the host and a compiler extension that automatically protects the PMCAs accesses to shared data elements in SVM with calls to low-overhead tryread() and try- write() functions. Using these functions, the PEs inside the PMCA can validate the response from accesses to SVM and protect themselves from using invalid data returned upon a miss in the IOTLB. In this case, the corresponding PE is put to sleep and the miss is handled by the driver running on the host. After handling the miss, the host wakes up the PE, which can then safely repeat the transaction previously missing in the IOTLB. Using real-life applications relying on pointer-rich data structures, we verified the applicability of our design. Our results show that for non-strictly memory-bound applications, the overhead introduced by our design is negligible compared to an ideal solution for SVM. The remainder of this chapter is organized as follows. Sec. 4.1 presents related work. Our mixed hardware-software SVM solution sup- porting zero-copy sharing of pointer-rich data structures is presented in Sec. 4.2. Finally, Sec. 4.3 discusses experimental results obtained from real-life applications relying on pointer-rich data structures.

4.1 Related Work

While some of the SVM designs proposed for HESoCs can deal with virtual address pointers [66, 67], they are still not suitable for 4.1. RELATED WORK 59 applications where complex and large data structures must be traversed based on other input data as well as intermediate results. First of all, these schemes rely on special memory allocators, which complicates the implementation of the host-side application (which builds up the complex data structure). But more importantly, they require the range of addresses accessible by the accelerator to be known at offload time to pin the corresponding memory pages and generate the DMA descriptors and/or dedicated page tables required by the accelerator. On one hand, this extensive page pinning and DMA descriptor/page table generation is wasteful as, depending on input data and run-time parameters, the accelerator might just access a small subset of the pinned pages. On the other hand, it is impractical, especially for pointer-rich data structures such as linked lists, as the host must traverse the entire data structure at offload time, which requires the programmer to write an application-specific routine similar to the routines used for copy-based shared memory. In short, heterogeneous SVM with proper support for on-demand IOTLB miss handling at accelerator run time is absolutely required to enable the efficient im- plementation of applications operating on pointer-rich data structures. The challenges in designing a suitable SVM system for modern accelerators are not only related to the target workloads but they are also of architectural nature. Differences in both accelerator and system architecture require major changes if not a complete redesign of existing hard- and software infrastructure for address translation. For example, while the adaption of the MMU of an embedded, single-threaded CPU for a multi-core accelerator [80] can lead to a functional SVM design, such a design does not cope with the parallel nature of the accelerator. Whenever a miss in the IOTLB occurs, any traffic of the accelerator to SVM must be stalled, even if the corresponding address mapping is present in the IOTLB. Also, the availability of a full-fledged IOMMU itself does not automatically guarantee for the best performance. Hardware IOMMU designs for high-bandwidth network and storage devices rely on kernel-level drivers to pin the shared pages in memory and to build up a dedicated I/O page table exclusively accessed by the IOMMU. Recent works [24,25] target at mitigating the two main bottlenecks of such systems, i.e., the scalability of IOVA (de-)allocation and IOTLB invalidation. Such systems are optimized for streaming large chunks of data in high-throughput, 60 CHAPTER 4. POINTER-RICH DATA STRUCTURES multi-Gb/s scenarios and less suitable for fine-grained data sharing between a host processor and a PMCA: The cost of setting up and removing a single mapping [25] are comparable to what we achieve with purely software-based IOTLB management, despite dedicated hardware IOMMUs including PTW engines and prefetching logic. To fully exploit hardware IOMMUs in heterogeneous comput- ing systems and turn their potential into a performance boost, the interaction between hard- and software needs to be reduced to a minimum. Instead of relying on a driver to pin shared pages and build up a dedicated page table for the IOMMU, the IOMMU itself must coherently operate on the page table of the user-space process and participate in TLB invalidation and shootdowns, just like a regular CPU core of the host. Such schemes have been studied in the context of high-performance systems featuring discrete FPGA-based accelerators [2] or GPGPUs [10,11] and require for tight integration into the host processor or accelerator architecture, respectively. As opposed to discrete high-performance accelerators connected to high-end host systems through PCIe I/O links, the PMCAs addressed in this thesis have a tight area and power budget and are highly integrated into HESoCs. Thus, we propose an SVM framework that is of low hardware complexity, not intrusive to the architecture of the host CPU and the accelerator PEs, suitable for parallel accelerator architectures like PMCAs, and that supports zero-copy sharing of pointer-rich data structures without relying on specialized memory allocators.

4.2 Infrastructure

In this section, we describe the hardware-software infrastructure en- abling SVM between host processor and PMCA in a HESoC matching the template described in Sec. 1.1.

4.2.1 Remapping Address Block To enable the host and the PMCA to communicate through SVM, the PMCA is connected to the main memory interconnect of the host through a multi-ported remapping address block (RAB), similar to the 4.2. INFRASTRUCTURE 61

IOTLB Host AXI4-Lite Slave ID FIFO Transaction ID Virtual Address Addr FIFO Transaction Size Transaction Valid Physical Address AXI4 Slaves Transaction Valid Miss IRQ er

PMCA ff Host AXI4 Masters

Read/Write Bu Response DRAM

Figure 4.1: Schematic of the RAB with two ports.

IOMMU shown in Fig. 1.1. The RAB is basically a software-managed IOTLB, which is used by the PMCA to translate virtual, host addresses of outgoing memory transactions to physical addresses. The setup of the IOTLB is completely done in software by the kernel-level driver module running on the host as described in Sec. 4.2.2. Fig. 4.1 shows a schematic of the RAB. Compared to the RAB design presented in Sec. 3.3.3, the hardware has been extended to support the handling of IOTLB misses. In case the IOTLB does not hold a valid slice configuration for the requested transaction, the transaction is dropped and an interrupt is sent to the host to signal an IOTLB miss. To the issuing master inside the PE, the RAB signals back a slave error in the AXI Read/Write Response signal. The missing address and the transaction ID are written to two separate first-in, first-out buffers (FIFOs) inside the RAB, which can be read by the miss-handling driver routine through the AXI4-Lite configuration interface.

4.2.2 The tryx() Operation The RAB is managed completely in software by a kernel-level driver module running on the host. The interaction of the different abstraction layers is visualized in Fig. 4.2. When the heterogeneous application is started, it must first reserve those addresses in its own virtual address space that overlap with the physical address space of the PMCA1 1 .

1This can be achieved with an mmap() system call with the flags MAP FIXED, MAP ANONYMOUS and PROT NONE to get exactly the reserved address segment, not back the mapping with any physical memory, and not contribute to the kernel’s overcommit limit. 62 CHAPTER 4. POINTER-RICH DATA STRUCTURES

5,400 PMCA cycles per user RAB miss on average Start 1 2 space 20% 50% 30% Time Host kernel process schedule space context worker thread 8 interrupt RAB-miss wake-up signal context interrupt 7 5 cycles 3 cycles 15 cycles 8 cycles pass virtual

address ZZ Accelerator PE Z via mailbox 3 4 5 6 9 3 4 5 load/store validation wake up load/store RAB look up go to sleep RAB look up RAB hit RAB miss

Figure 4.2: Interaction of different abstraction layers and timing diagram for tryx() operations leading to a RAB miss and a RAB hit.

This step is required to make sure that the host never passes a virtual address pointer to the PMCA that overlaps with its own address space, since the PMCA would not route an access to such a virtual address through the RAB but instead route it internally, e.g., to its internal memories or memory-mapped registers. After copying the PMCA executable to the internal L2 SPMs, the driver can start the PMCA and pass the virtual address of the shared data elements to the PMCA using a memory-mapped mailbox 2 . On the PMCA side, accesses to data elements shared with the host (i.e., residing in main memory) need to be protected with calls to low-overhead tryread() and trywrite() functions. In the following, the term tryx() is used to refer to both these functions. First of all, they simply issue the load or store to the address of interest 3 . In case of a RAB miss, i.e., if no valid mapping for this address is set up in the RAB, the RAB i) stores the missing address and the transaction ID, ii) sends an interrupt to the host, and iii) returns 0 to the issuing PE in the read data signal in case of a load, and signals a slave error in the read/write response signal 4 . This response is then forwarded to a private, special-purpose, low-latency access TRYX control register placed close to the data interface of the PE. To validate the response returned by the RAB and to check whether a miss happened, the tryx() functions issue a read to this register 5 . In case of a miss, the PE goes to sleep and waits for the host to handle the miss 6 . 4.2. INFRASTRUCTURE 63

To handle the RAB miss on the host, the kernel-level driver uses the Concurrency Managed Workqueue (CMW) API of Linux. Upon receiving a RAB miss interrupt, the interrupt handler simply enqueues the miss-handling routine to a dedicated workqueue 7 . The worker thread executing this routine is then scheduled in normal process context 8 . This is required because some kernel functions executed by the routine may sleep, and therefore cannot be executed in interrupt context. Once started, the routine reads the missing address and the transaction ID from the corresponding FIFOs in the RAB. After locking the requested user-space page in memory using get user pages(), the virtual-to-physical address translation is performed, and a new RAB slice for the missing mapping can be configured. In case all RAB slices are in use, the routine simply removes the oldest mapping and unlocks the corresponding user-space page before setting up the new remapping (FIFO replacement strategy). The remapped page is flushed from the data caches of the host back to DRAM and the sleeping PE can be woken up. The miss-handling routine then continues to handle misses until the FIFOs inside the RAB are empty. After waking up, the PE can safely repeat the load or store to the address of interest 9 and then continue normal program execution. In the case of a RAB hit, the memory request of the PE 3 is simply forwarded to shared memory 4 . The PE validates the response returned by the RAB by reading the TRYX control register 5 and continues program execution.

4.2.3 Compiler and Runtime Support Modern programming models are evolving to simplify the development of applications for heterogeneous systems. The application developer can focus on partitioning the computation and specifying the parts to be offloaded to the PMCA—using, e.g., OpenMP offload directives [81] or the lower-level programming style of OpenCL [26]—while the underlying compiler and runtime system take care of implementing the desired heterogeneous and parallel semantics. The proposed framework relies on a program transformation implemented in the compiler that transparently protects (via a call to the tryx() functions) the access to those data items that are accessed from the host (main) memory. The key idea leveraged to build this compiler pass is that computation 64 CHAPTER 4. POINTER-RICH DATA STRUCTURES offloading in any programming model is specified through some sort of language construct to specify which data originated in the host execution environment are later communicated to the PMCA. As an example, consider the following code snippet written using the OpenMP 4.0 specification [81].

#pragma omp target map(vertices) map (to:n v e r t i c e s ) { /∗ offloaded code ∗/ }

The target directive is used to syntactically highlight which parts of the host program are to be compiled and executed on the accelerator. The map clause allows to list a set of host variables (scalars, arrays or parts of arrays) that the PMCA has access to in read-only (map(to: )), write-only (map(from: )) or read-write mode. Starting from this information, the proposed compiler pass anno- tates all the uses of such variables within the target block (i.e., the code that will execute on the PMCA, and that thus needs protection of accesses going to shared memory). Fig. 4.3 shows the transfor- mation process on a simplified excerpt of PageRank [82]. Two host variables are annotated for PMCA access: vertices (Variable 1) and n vertices (Variable 2). Variable 2 is of type int, thus the compiler just emits a tryread() function call on its address. The analysis is more complex on Variable 1, which is of a pointer type and pointing to a struct containing several fields (many of which pointers), as it usually happens for pointer-chasing access patterns. Here, not only do we need to follow the use-definition chains of the map variables, but we also need to apply escape analysis to determine when dereferencing a pointer is interpreted as an address whose value is further propagated through the program. A statement where such an address is read into temporary storage is thus marked as an escape point, and the analysis is recursively applied. Concretely, Variable 1 in Fig. 4.3 is subject to several such points. First, its address is taken to compute the offset for the element i (vertices[i] or vertices + i) (labeled as 1 in the figure). Then, this new base address is further dereferenced to access the fields pagerank 1.1 , pagerank next 1.2 and n successors 1.3 . The transformation pass protects all accesses to these data items with calls to tryx(). DMA transfers are protected inside the DMA transfer setup routine of the PMCA runtime. If the cluster-external address is virtual, the 4.2. INFRASTRUCTURE 65

struct vertex{ unsigned int vertex_id, n_successors; float pagerank, pagerank_next; vertex ** successors; TRANSFORMED PROGRAM } * vertices; 2 1 2 for (i=0; i< tryread(&n_vertices); i++) #pragma omp target map(vertices) map(to:n_vertices) { 1 { 2 vertex_i = tryread(&vertices) + i*20; //&vertices[i] for (i=0; i< n_vertices; i++) { 1.1 p_rank = compute(...); 1 vertices[i].pagerank = compute(...); vertices[i].pagerank_next = compute_next(...); 1.1 trywrite(&vertex_i->pagerank,p_rank); 1.2 trywrite(&vertex_i->pagerank_next,compute_next(..)); pr_sum += (vertices + i)->pagerank; 1.2 1 1.3 pr_sum = p_rank + pr_sum; if ((vertices+i)->n_successors->n_successors == 0) pr_sum_dangling += (vertices + i)->pagerank; 1.3 if (tryread(&vertex_i->n_successors)tryread(&vertex_i->n_successors) == 0) } pr_sum_dangling = p_rank + pr_sum_dangling; ... } } ORIGINAL PROGRAM

Figure 4.3: Accesses to host data within offloaded code blocks are transparently instrumented by the compiler. routine inserts a tryread() call to every memory page touched by the DMA before starting the transfer. If the cluster-external address is physical, the transfer can be setup right away. PMCA-internal DMA transfers need no protection.

4.2.4 Comparison with Existing Solutions The mixed hardware-software solution presented in this work imple- ments all the functionality required for SVM and is fully functional. How it compares with hardware IOMMUs is shown in Tbl. 4.1. The IOMMUs found in today’s high-end desktop CPUs [7,8] and embedded systems [28] are fully hardware managed. In case of an IOTLB miss, a dedicated hardware block walks the page table to do the virtual-to-physical address translation. Typically, some sort of least recently used (LRU) or Pseudo-LRU replacement algorithm is used to decide which IOTLB entry to replace. Such replacement allow to reduce the IOTLB miss rate, but they require hardware support to monitor the usage of the individual IOTLB entries. In contrast, the software routine handling the misses in our design has no knowledge on the usage of the IOTLB entries besides the setup time. The only viable option is thus a simple first-in, first-out (FIFO) algorithm. The advantage of our design is that, irrespective of the number of currently outstanding misses, the IOTLB can continue to serve 66 CHAPTER 4. POINTER-RICH DATA STRUCTURES nerp eapn yes n/a remapping Interrupt contexts device # misses outstanding # walker Page-table IOTLB: Application b a Prefetching - Replacement - [cycles] Latency - rhtcuermt (optional) remote Entries # - Architecture - o temn cespten nw toodtm [36] time offload at known patterns access streaming for interface command device M 7,Itl[]Mla[5 oe 1]Pca 1]AM[8 onrs[4 hsWork This [64] Kornaros [28] ARM [10] Pichai [11] Power [25] Malka [8] Intel [7], AMD High-throughput / devices, I/O GPGPUs oa and local R [7] LRU HW HW n/a n/a n/a a [7] (c) al .:Cmaio fSMsolutions. SVM of Comparison 4.1: Table High-throughput / eie MMU devices I/O hrdprC e CU per CU per shared FIFO HW HW n/a n/a 256 1 0 c PPsGPGPUs GPGPUs HW / n/a n/a d c ono no ono no 418cngrbe6 configurable 64 configurable 128 64 232 32 n/a 1 1 WcetsddctdIOpg tables page multi-threaded I/O dedicated creates SW d HW 1 e e System-Level e-otL1 per-port hrdL2 shared 8/16 HW HW n/a n/a 128 yes c,d (c), e Virtualization Accelerator Embedded hrdprport per shared R FIFO LRU 1000 HW HW n/a 6 1 a f e Embedded PMCAs Wqueue SW IOTLB per SW SW no ∞ 1 1 f b 4.3. EXPERIMENTAL RESULTS 67 translation requests without blocking the traffic of the PMCA to shared memory. In hardware-managed IOMMUs, this is mitigated using complex hardware designs including multi-threaded PTW engines [11, 28], and hierarchical IOTLB architectures. For example, IOMMUs targeting high-throughput I/O devices as well as embedded GPGPUs [7, 8] support both IOTLBs local to the IOMMU as well as remote IOTLBs inside the device. ARM’s system-level MMU [28] uses per-port L1 and a shared L2 IOTLB. Per IOTLB, up to 8 or 16 concurrently outstanding misses are supported. In GPGPUs, IOTLB misses usually happen in big bursts. To allow other warps to access the shared memory while the misses of stalled warps are handled, every compute unit (CU) consisting of many PEs features a private IOTLB including multiple miss status holding registers. The PTW engine is either shared but massively multi-threaded [11] or private per IOTLB [10]. It is important to note that a hardware IOMMU itself not automati- cally guarantees best performance. While some IOMMUs [7,8] enabling SVM for embedded GPGPUs in modern high-end heterogeneous SoCs [5, 6] as well as ARM’s system-level MMU are fully hardware managed, today’s OS frameworks such as the Linux IOMMU API do not necessarily allow the hardware to operate on the page table of the user-space host process. Instead, software drivers may be required to build up a dedicated I/O page table and explicitly map shared memory pages to it. Originally, this concept was introduced to give high-throughput I/O devices direct access to kernel-space memory and to protect the host system from malicious or faulty DMA devices and drivers (refer to Sec. 1.2). However, the explicit mapping can be a major bottleneck and has been addressed by recent works in the context of high-throughput I/O devices [24, 25]. The support for interrupt remapping and multiple device contexts, i.e., multiple host applications concurrently offloading kernels to the PMCA, is currently not implemented in our design.

4.3 Experimental Results

We first present our evaluation platform, i.e., our embodiment of the target architecture template presented in Sec. 1.1. The cost of the proposed SVM solution are compared to alternative designs in 68 CHAPTER 4. POINTER-RICH DATA STRUCTURES

Sec. 4.3.2. Sec. 4.3.3 gives the results for three real heterogeneous applications based on pointer-rich data structures that were run on the evaluation platform, compared to an ideal SVM implementation. In Sec. 4.3.4, we introduce a synthetic model to predict the performance of our solution and apply it to the MiBench [59] embedded benchmark suite.

4.3.1 Evaluation Platform Our evaluation platform is based on the Xilinx Zynq-7045 SoC [55]. Fig. 4.4 gives an overview. The PS of the Zynq SoC features in- terconnects, peripherals, a DRAM controller, and the APU with a dual-core ARM Cortex-A9 CPU at its heart. Each core has separate L1 instruction and data caches with a size of 32 KiB each. Further, the APU offers 512 KiB of shared L2 instruction and data cache that connects to the high-priority port of the DRAM controller. The programmable system is used to implement the host of the HESoC. It is running Xilinx Linux 3.18. The PL of the Zynq SoC consists of a Xilinx Kintex-7 FPGA which is used to implement PULP: a PMCA developed as an ASIC for Parallel Ultra-Low Power Processing [71]. To overcome scalability limitations, PULP leverages a multi-cluster design. The PEs within a cluster feature 4 KiB of shared L1 instruction cache, and share 72 KiB multi-banked tightly-coupled data memory as L1 SPM. The memory banks are connected to the PEs through a low-latency crossbar interconnect with a word-level interleaving scheme to minimize access contention. Ideally, every PE can access one word in the L1 SPM per cycle. The L1 SPMs of all clusters as well as the 256 KiB L2 globally shared SPM are mapped in a global, physical address space, meaning that the PEs can also access data in remote SPMs, albeit with a higher latency. Every cluster features a lightweight, low-programming-latency, multi-channel DMA engine which allows for fast and flexible movement of data between L1 and L2 memory or shared main memory [72]. The event units inside the cluster peripherals are used for both intra- and inter-cluster synchronization, and to put PEs into sleep and to wake them up. The PMCA is attached to the host as a memory-mapped device and controlled by a kernel-level driver module and a user-space runtime. 4.3. EXPERIMENTAL RESULTS 69

Peripheral-to-AXI protocol converter forwards Host PMCA AXI response signals to TRYX control registers. Bank 0 L1 SPM Bank M-1 L1 SPM Bank 1 L1 SPM Bank 2 L1 SPM L2 Cluster 0 A9 Core 0 A9 Core 1 DMA SPM L1 SPM MMU MMU Cluster 1 L1 I$ L1 D$ L1 I$ L1 D$ Mailbox X-Bar Interconnect L1 SPM

Coherent Interconnect SoC Bus

Cluster Bus Per2AXI Cluster L-1 TRYX TRYX TRYX System RAB AXI2Per DMA L2 $ L1 SPM DEMUX DEMUX DEMUX Event Unit IRQs AXI GP0 AXI HP0 Core Core Core Timer 0 1 N-1 DDR DRAM Controller Bus Peripheral

Shared L1 I$ DDR DRAM

Remapping Address Block (RAB) Event Unit wakes up sleeping cores. managed by kernel-level driver Low-latency access TRYX control register Xilinx Zynq-7000 module on host.

bu✁ ers AXI response of last transaction. All-Programmable SoC

Figure 4.4: The evaluation platform with host and PMCA implemented in the PS and the PL of the Xilinx Zynq SoC.

The host and the PMCA share 1 GiB of DDR3 DRAM. The RAB uses one port with 32 slices for PMCA-to-host communication and connects the PMCA to the HP AXI slave port of the DRAM controller. In case of a miss, the RAB signals a slave error back to the cluster internal Per2AXI protocol converter using the AXI Read/Write Response signal. The protocol converter propagates the response signal back to the per-PE private, low-latency access, special-purpose TRYX control registers placed close to the demux, which connect to the data interfaces of the PEs. The main goal of this platform is to study and evaluate the system- level integration of a PMCA like PULP into a HESoC. We did not optimize the PMCA for the implementation on the FPGA. The FPGA implementation should more be seen as an emulator instead of a fully featured accelerator. The PMCA has a single cluster comprising four PEs. We adjusted the clock frequencies of the different components to obtain ratios similar to a real heterogeneous SoC with host and PMCA running at 2133 MHz and 500 MHz, respectively. The speed of the DDR3 DRAM is selected such as to model a shared LLC with a total access latency of 15 and 14 clock cycles for read and write accesses issued by the PMCA, respectively. 70 CHAPTER 4. POINTER-RICH DATA STRUCTURES

Table 4.2: FPGA resource utilization.

Block LUTs [k] FFs [k] BRAM [kbit] PULP Cluster 48 31 623 RAB 3.90 3.60 0.5 TRYX Control 0.14 0.04 0 IOMMU [64] 11.2 408

4.3.2 Shared Virtual Memory Cost Tbl. 4.2 gives an overview of the hardware cost of our solution. In terms of LUTs and FFs, the FPGA resource utilization of the RAB is about 8% and 12% that of a PULP cluster with four PEs, respectively. The TRYX control register accounts for 0.3% and 0.1% of the LUTs and FFs of a cluster, respectively. Further, the table lists the resources of a full-featured hardware IOMMU [64]. This IOMMU features a 64-entry IOTLB and has been implemented on a XC6VLX760 device. As expected, the total resource utilization of our solution is much lower than that of the full-featured IOMMU. It is worth noting that the CAM of the IOMMU has a maximum access latency of 6 cycles which allows for an implementation using BRAM instead of LUT and register slices. In contrast, the CAM of the RAB has a look-up latency of 1 cycle. Moreover, the RAB allows for mappings of arbitrary size, independent of the page size. Constraining the RAB to page-sized mappings only, and relaxing the access latency of the CAM would allow to further reduce hardware cost. Finally, the IOMMU supports only one outstanding IOTLB miss. If another miss happens while the PTW engine is busy, any traffic to shared memory is blocked. In contrast, the RAB simply enqueues the missing address to the software routine on the host and continues to service requests from other unblocked cores and DMA engines. We used a synthetic benchmark application to profile the primitives of our SVM solution. The host allocates a large array and passes a pointer to this array to the PMCA using the mailbox. The PEs of the PMCA then issue read and write accesses to this array. To measure the tryx() hit and miss times, the application uses the performance counters inside the PEs and the timers inside the cluster peripherals, 4.3. EXPERIMENTAL RESULTS 71 respectively. The latter are also accessible from outside the cluster and can be used by the kernel-level driver running on the host to profile the RAB miss-handling routine. On average, handling a RAB miss takes 5,400 PMCA clock cy- cles (nc,miss). This includes the time required by the event unit to put the PE to sleep and to wake it up which is 5 and at most 3 clock cycles, respectively. Roughly 20% of nc,miss it takes until the host handles the interrupt. Another 50% it takes until the worker thread that actually handles the miss is scheduled in process context. Only 30% of the time is spent to walk the page table using get user pages() and to update the configuration of the RAB. Running the miss-handling routine in process context is required because get user pages() requires to acquire the semaphore protecting the OS’ memory management struc- tures of the user-space process and therefore may sleep. Implementing a custom routine to walk the page table and lock the user-space pages could speed up the RAB miss-handling routine by roughly 70%. However, this would be at the expense of portability. Our RAB miss handler using Linux kernel APIs only is fully independent of the host hardware. Instead, we optimized our solution for a fast common case, i.e., the tryx() operations hitting in the RAB. The overhead nc,tryx of the tryx() is 8 clock cycles. This is the time a PE requires to read the TRYX control register and to decide whether to go to sleep in case of a miss or to continue program execution. How the proposed solution compares to alternative shared memory designs is visualized in Fig. 4.5. Ideally, the host performs its compu- tation 1.1 and simply passes virtual address pointers to the PMCA 2.1 . With an optimal virtual memory subsystem, the PMCA can access the shared data directly from main memory through the virtual address pointers, and execute the highly parallel application parts without SVM-related overheads 3.1 . Such an ideal SVM subsystem would be a zero-latency IOTLB that always contains the required mapping. In contrast, real SVM solutions such as the IOMMUs found in high-end heterogeneous SoCs [5, 6, 13, 28] are not ideal. Real IOTLBs can have look-up latencies of multiple cycles [64] and handling misses introduces additional latency. For example, letting a hardware IOMMU operating on an optimized, dedicated I/O page table (created by a driver module at offload time) handle an IOTLB miss takes around 1,500 host or 350 PMCA cycles [25]. In case too many misses are outstanding, any 72 CHAPTER 4. POINTER-RICH DATA STRUCTURES

Host execution PMCA execution Time

a) Ideal SVM 1.1 2.1 3.1 PMCA execution-time overhead

O✁ oad

b) Real SVM 1.1 2.1 3.2

O✁ oad overhead

c) Copy-based, shared memory 1.1 2.2 3.1

Host execution-time overhead

d) CMA-based SVM 1.2 2.1 3.1

Figure 4.5: Different shared memory designs and associated execution- time overheads.

accelerator-to-host communication is blocked. Similarly, the proposed SVM solution visualized in Fig. 4.5 b) allows to simply pass virtual address pointers to the PMCA 2.1 . The latencies associated with the tryx() functions and RAB miss handling can cause an overhead in PMCA execution time compared to the ideal SVM solution 3.2 (shaded area). The experiments presented in the following sections aim at quantifying this PMCA execution-time overhead. Without SVM, the host must copy shared data between virtual and the physically contiguous, uncached memory accessible by the PMCA, and adjust any virtual address pointers in the shared data. This causes substantial overheads at offload time as indicated in Fig. 4.5 c) 2.2 . Copying a single 4 KiB memory page of data to physically contiguous memory without even modifying pointers takes at least 10,200 PMCA clock cycles. Copying back the page once the PMCA is done takes another 20,500 cycles. Even in embedded systems, the data shared between host and PMCA quickly exceeds the size of a single page. For example, already a low-resolution video graphics array (VGA) image in red, green and blue (RGB) format occupies 225 4 KiB memory pages. Doing data copies quickly becomes a bottleneck in terms of energy and performance. Compared to the copy-based approach typically used by low-power embedded systems, our lightweight SVM solution requires on average 5,400 cycles to handle a RAB miss and is thus at least 5.7x faster. 4.3. EXPERIMENTAL RESULTS 73

An alternative design based on the Linux CMA [30] is visualized in Fig. 4.5 d). A kernel-level driver requests virtual memory from a large, physically contiguous section allocated at boot time and exposes it to the user-space application through mmap(). The host simply passes the virtual addresses of the shared data allocated in the CMA region to the PMCA 2.1 . The PMCA directly accesses the shared data from main memory through these pointers without overhead 3.1 . The virtual-to-physical address translation simplifies to applying a constant offset. However, there are considerable drawbacks associated with this technique besides the reduced programmability. First, there is no guarantee that the CMA region can be made available at run time [32]. Second, CMA returns uncached memory on ARM systems which can imply substantial host execution-time overheads 1.2 . Depending on the amount of processing done by the host, relying on the copy-based approach may be faster.

4.3.3 Pointer-Chasing Applications

The three real-life applications studied operate on pointer-rich data structures and exhibit irregular memory access patterns that are data dependent, not known in advance and thus not amenable to DMA tiling. Offering high degrees of parallelism and possibly little computation only, they can be communication intensive. Moreover, they represent typical example workloads of the ”big data” application space, including graph processing, that is prominent in data centers, and where heterogeneous architectures with configurable accelerator infrastructure [2–4] are gaining traction because of their energy efficiency. In contrast, stream- oriented, data-parallel application kernels traditionally offloaded to GPGPUs can be more memory bound but feature highly regular and predictable access patterns known at offload time. Using software- managed SVM together with an IOTLB double-buffering scheme as presented in Chap. 3, such kernels—even with very low operational intensities—can be efficiently offloaded to PMCAs. 74 CHAPTER 4. POINTER-RICH DATA STRUCTURES

Application Description Before discussing the results, we first give a brief description of each application below.2 PageRank (PR): PR is a typical example of a pointer-chasing application. Originally, it was used by Google to rank web sites [82]. Every web site is represented by a vertex, and a link from one site to another is represented by an arc between the two corresponding vertices. The whole graph is initialized by equally ranking all vertices. The algorithm then iteratively processes the graph. The rank of every vertex is divided by the number of successor vertices and added to their rank. At the end of every iteration, the rank of dangling vertices is equally distributed to all vertices, and all ranks are normalized. The procedure is repeated until the ranks converge. Since the number of computations performed in every vertex is very low, basically a single division and one addition per successor vertex, PR is highly communication intensive. Furthermore, it features low locality of reference and therefore represents a worst-case scenario. Parallelization of PR using OpenMP is achieved on a vertex level. Random Hough forests (RHFs): The second benchmark ap- plication is the classification stage of an object detector using random Hough forests [83], i.e., a set of binary decision trees. To detect the bounding boxes of instances of a class in an image, the application computes the corresponding Hough image using RHFs. Once all patches have been classified, a Gaussian filter is applied to the Hough image. The detection hypotheses are found at the maxima locations, and the values at these locations serve as confidence measures. Face detection (FD): The third application uses the well-known Viola-Jones algorithm [84]. To detect a face in a particular location in an input image, the corresponding image patch is fed to a degenerate decision tree, a so-called cascade. Per node, one or multiple weak classifiers (weaks) are computed. Every weak specifies a simple test to perform on the patch and a threshold. If the weighted sum of the outputs of the weaks is below a node-specific threshold, it is very likely that the patch does not contain a face and the detection is aborted. Otherwise, the patch is fed to the next node, where the same

2A more detailed description of the applications and the adopted parallelization schemes can be found in App. A. 4.3. EXPERIMENTAL RESULTS 75

Table 4.3: Pointer-chasing applications.

Application Graph Size mca [%] CCR [%] mT LB [%] PR 1k vertices 36 KiB 2.2 29 0.0020 PR 4k vertices 144 KiB 7.0 30 0.0021 PR 10k vertices 264 KiB 8.4 31 0.0066 RHFs 2 trees 3.5 MiB 1.4 18 8.5 RHFs 4 trees 7.0 MiB 1.4 18 7.9 RHFs 10 trees 17.5 MiB 1.5 18 7.1 FD Cascade 1 25.75 KiB 1.5 33 0.0044 FD Cascade 2 36.08 KiB 1.6 32 0.0058 procedure is repeated with different weaks. The cascades are designed such as to reject negative patches with as little computation, i.e., as early as possible in the cascade. The individual classifiers are trained to have a detection rate close to 100%, while the false positive rate can be fairly high. The overall high detection accuracy is reached by cascading multiple weaks and cascades.

Measurement Results

The three benchmarks were profiled with different sets of input data. Tbl. 4.3 lists the details of the different configurations. It gives the size of the pointer-rich data structures in shared memory, the L1 data cache miss rate mca (measured when executing the graph processing stage on the host), the communication-to-computation ratio (CCR), i.e., the ratio of load/store instructions to the total number of instructions of the offloaded application part, and the RAB miss rate mT LB. Fig. 4.6 shows the execution time of the tested configurations normalized to the execution time achievable with an ideal virtual memory solution, i.e., a system without tryx() overhead and an IOTLB that always contains the required mapping. This can be achieved by copying the pointer-rich data structures to a physically contiguous, nonpaged, uncached part of the shared memory at offload time, adjusting any pointers in the data structures, and by disabling the compiler extensions inserting the tryx(). As visualized in Fig. 4.5 c), this implies considerable overheads on the host side. For example for PR, we measured execution time and 76 CHAPTER 4. POINTER-RICH DATA STRUCTURES

1.8 1.7 1.6 1.5 1.4 1.3 Capacity Misses

1.2 Compulsory Misses 1.1 1.0 Average Execution Time Execution Average Normalized to Ideal SVM Normalized

PR PR PR RHFs RHFs RHFs FD FD 1k Vertices 4k Vertices 10k Vertices 2 Trees 4 Trees 10 Trees Cascade 1 Cascade 2

No Misses With Misses, FIFO With Misses, LRU With Misses, FIFO, With Cache

Figure 4.6: Slowdown compared to ideal SVM for the different configurations of the pointer-chasing benchmarks. code size overheads of 78% and 90%, respectively. Fig. 4.6 also shows the performance when using a software cache [21]. Besides avoiding the need for hand-optimized DMA programming, this reduces the pressure on the virtual memory system. Only misses in the software cache need address translation and thus protection with tryx() operations. The software cache has an access latency nc,ca of 8 cycles and the cache miss rate was modeled as mca = 10% for all configurations. PageRank (PR): Even for the very memory-intensive PR, the slowdown due to the tryx() is below 15% (No Misses in Fig. 4.6). Considering also misses (With Misses, FIFO), even a marginal increase of the RAB miss rate mT LB leads to an additional slowdown of 65% (PR 10k vertices). The reason for this high impact is twofold. First, PR is memory bound (CCR = 31%). Even a very low miss ratio means many misses. Second, PR has low locality of reference (mca = 8.4%). The graph with 10k vertices is not sufficiently small to be remapped with the available 32 RAB slices. Since the RAB slices are configured on demand, the very first access to every memory page of the pointer-rich data structure causes a compulsory miss in the RAB. Once all 32 RAB slices are in use, every RAB miss causes the oldest remapping to be replaced by the missing remapping. The next access to the replaced remapping then causes a capacity miss. Due to the low locality of reference of PR, the percentage of capacity misses is very high (≈ 95%) 4.3. EXPERIMENTAL RESULTS 77 and handling these dominates the overall execution time as indicated in Fig. 4.6. Using an LRU replacement strategy helps to reduce the number of RAB misses by up to 11% which translates into a speedup of up to 4.5% (With Misses, LRU ). A more effective technique to decrease the impact of RAB misses is the software cache (33% speedup). It reduces the pressure on the virtual memory system such that even a very simple FIFO replacement strategy and a non-optimized, but portable miss handler allow for adequate performance (With Misses, FIFO, With Cache). Random Hough forests (RHFs): The impact of both the misses and the tryx() is much less pronounced, despite of relatively high RAB miss rates. There are two main reasons for the lower susceptibility. First, the offloaded kernel features more computations (lower CCR) compared to PR, which allows to amortize the cost of the tryx() and the misses. Second, the last step of the feature extraction operates on data previously copied to the L1 SPM using DMA. The feature extraction does not need to access the shared memory and needs no protection. Moreover, the kernel has a higher overall locality of reference which is also indicated by the relatively low host cache miss rate of 1.5%. Performing LRU replacement reduces the number of RAB misses by at most 2%. The performance increases by less than 1%. The reason for this minor speedup are the large trees and the low temporal locality of the classification phase. Between two accesses to the same tree node, many other nodes are accessed, which likely causes all mappings to be replaced in the RAB. What can help here is least frequently used (LFU) replacement, explicit locking of RAB entries remapping most frequently accessed parts of the trees such as the root nodes, or the use of a software cache. Face detection (FD): FD combines a fairly high CCR with reasonably good locality of reference (mca ≤ 1.6%). Consequently, the cost of the tryx() cannot be amortized as well as for RHFs but it is not as high as for PR. This results in a slowdown of 6% (No Misses). Since the cascades are sufficiently small to be remapped with the available RAB slices, only compulsory misses occur and the RAB miss rate mT LB ≈ 0.005% is substantially lower compared to RHFs. Neither the LRU replacement strategy nor the software cache can significantly increase performance. 78 CHAPTER 4. POINTER-RICH DATA STRUCTURES

4.3.4 Synthetic Model The cost of the primitives presented in Sec. 4.3.2 can be used to predict the execution-time overhead of our solution compared to an ideal SVM solution for arbitrary applications. Assuming that any computation instruction takes 1 clock cycle, the total execution time in cycles nc,exe of an application with ninst instructions and a given CCR is

nc,exe = ninst (1 − CCR + CCR · nc,mem) (4.1) with nc,mem being the average number of cycles per memory access. Ideally, the PMCA uses double-buffered DMA transfers between shared memory and its internal L1 SPMs to hide the access latency to shared memory. The memory operations of the PEs then all go to the cluster internal L1 SPM and take a single cycle only. The programming effort required to implement such a shared memory access scheme can be very high. To ease programming and still let the PEs access the data from the fast L1 SPMs, a software cache [21] can be used. Given the number of cycles nc,ca to access a word in the software cache, the miss rate mca of this software cache, the number of cycles nc,LLC to access a word in the shared LLC, the overhead nc,tryx of the tryx(), the miss rate of the TLB mT LB, and the number of cycles nc,miss to handle a TLB miss, the number of cycles per memory access can be given as

nc,mem = nc,ca + mca (nc,LLC + nc,tryx)

· (1 + mT LB · nc,miss) . (4.2)

In the following, we focus on the analysis of the overhead of the tryx() and therefore assume mT LB = 0. The slowdown s compared to an ideal SVM can then be formulated using (4.1) and (4.2):

1 − CCR + CCR (n + m (n + n )) s = c,ca ca c,LLC c,tryx . (4.3) 1 − CCR + CCR (nc,ca + mca · nc,LLC )

Fig. 4.7 shows the slowdown s due to the tryx() operations com- puted by this formula for different cache miss rates mca and CCRs. The overhead nc,tryx of the tryx() is 8 clock cycles in case of a RAB hit, and nc,LLC = 15. We selected a software cache access time nc,ca of 8 cycles (= nc,tryx). As one can see, for applications offering a suitable 4.3. EXPERIMENTAL RESULTS 79

1.40 m = 15% 1.35 ca 1.30 m = 10% ca 1.25 1.20 m = 5% ca 1.15 1.10

Execution Time m = 1.5% ca 1.05 Normalized to Ideal SVM 1.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 CCR

Figure 4.7: Slowdown compared to ideal SVM predicted by the synthetic model. amount of computations (CCR < 30%) the slowdown is below 22% even for relatively high data cache miss rates of 15%. The synthetic model is also in line with the results measured for the heterogeneous applications shown in Fig. 4.6. For example, the measured slowdown for PR (mca ≤ 10%, CCR ≈ 30%) is below 15% (No Misses) while the slowdown predicted by the model is below 17%. For FD (mca ≈ 1.5%, CCR ≈ 30%), the measured and the predicted slowdown is 6% and 7%, respectively. Finally, this model can be applied to benchmark applications. The MiBench [59] embedded benchmark suite contains several, relatively small kernels from different application domains such as automotive, consumer, network, office, security and telecommunications, which are good candidates to be accelerated by a PMCA. We compiled and ran a majority of the benchmarks on the ARM host processor. The CCR can be extracted from the instruction distributions provided by the original MiBench publication [59]. It is between 11% and 52% for bitcount and typeset, respectively (33% on average). For the miss rate mca of the software cache in the model, the actual L1 data cache miss rate of the applications can be used. The miss rates were measured using a custom kernel module accessing the performance monitor units of the ARM. They vary between 0.2% (PGP) and 2.2% (lame, typeset). The slowdown for the various applications is shown in Fig. 4.8. It can be seen that for many applications, the expected slowdown is 80 CHAPTER 4. POINTER-RICH DATA STRUCTURES

1.12

1.10

1.08

1.06

1.04

1.02

Average Execution Time Normalized to Ideal SVM 1.00

sha qsort lame FFTIFFT bitcount typeset dijkstrapatricia CRC32 Average basicmath susan edges jpeg jpegencode decode pgp pgpencode decode stringsearch gsm gsmencode decode susan corners rijndaelrijndael encode decode adpcmadpcm encode decode susan smoothing blowfishblowfish encode decode Automotive Consumer Security Network O ce Telecommunications

Figure 4.8: Estimated slowdown compared to ideal SVM for MiBench. below 2%. A notable slowdown can only be seen for applications featuring both a CCR above 40% and a cache miss rate greater than 2%. If the CCR is reasonably low, i.e., if the application offers enough computations to amortize the cost of the tryx(), or if the cache can effectively filter the traffic to the shared memory, the slowdown of our SVM solution is negligible.

4.4 Summary

In this chapter, we have presented a mixed hardware-software frame- work enabling SVM for PMCAs in HESoCs. We verified the pro- posed design using real-life applications relying on pointer-rich data structures. Our results show that for non-strictly memory-bound applications, the overhead introduced by our design is negligible compared to an ideal solution for SVM. On the hardware side, our design uses a simple remapping address block (RAB), to translate virtual addresses (as seen by the application running on the accelerator) to their physical counterpart in main memory. The IOTLB inside the RAB is completely software-managed by a kernel-level driver module running on the host processor. To support zero-copy offloading of application kernels operating on pointer-rich data structures, for which the access pattern of the accelerator to shared main memory is data 4.4. SUMMARY 81 dependent and not known in advance, our solution uses a compiler extension to automatically protect the accesses of the accelerator to shared data elements with calls to low-overhead tryx() functions. These functions validate the response of the shared memory accesses using a low-latency access, special-purpose register. In case of an IOTLB miss, the calling PE is put into sleep and woken up after the miss has been handled by the host. The presented SVM design is fully functional. As such, it sub- stantially eases the implementation and also improves the offload performance of applications relying on pointer-rich data structures on HESoCs. In contrast to full-fledged hardware solutions prevalent in the HPC domain and tightly integrated into the accelerator architecture, our design is of low hardware complexity and non-intrusive to the architecture of the accelerator PEs, and is thus suitable for adaption in various embedded accelerators without inherent hardware support for SVM. To improve the performance for more memory-bound applications, mainly two components of the framework can be optimized to reduce the IOTLB miss-handling overhead. First, the IOTLB can be directly managed on the PMCA to avoid the high interrupt latency when involving the host in case of IOTLB misses. This option is studied in Chap. 5. Second, using an alternative TLB architecture, the capacity of the IOTLB can be increased at low hardware cost to reduce the number of capacity misses happening in the IOTLB. This option is addressed in Chap. 6.

Chapter 5

On-Accelerator Virtual Memory Management

True SVM with on-demand IOTLB miss-handling at accelerator run time is key for efficient implementations of pointer-chasing applications on heterogeneous systems with reasonable design-time effort. In contrast to the full-blown, hardware-only solutions predominant in modern HPC systems [2–4], we have come up with a mixed hardware- software SVM design better suited in the context of constrained HESoCs (see Chap. 4). Such solutions can reduce area and energy overheads compared to full-featured hardware IOMMUs. Moreover, they offer much greater flexibility than possible with the reactive miss handling (MH) performed by purely hardware-managed designs: Software-managed virtual memory can potentially exploit higher-level information of the application to anticipate the access pattern to shared memory and thus hide latency and increase the predictability of the system (see for example Chap. 3). However, these advantages come at the price of possibly consid- erable run-time overheads for highly memory-bound workloads. One of the key sources of overhead of such designs is remote IOTLB miss handling to be performed by the host [37,85,86]. If a PE inside the PMCA causes an IOTLB miss, this particular PE is put to sleep and the host is notified to handle the miss. The inability to handle misses

83 84 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT locally on the accelerator is the main limitation of such solutions, causing delays of thousands of clock cycles due to interrupt-based accelerator-to-host communication. In this chapter, we present an integrated hardware-software solution that enables on-PMCA page table walking to efficiently implement IOTLB miss handling on HESoCs. Operating cache coherently on the page table of the offloaded user-space process managed by the host, our design allows the PMCA to autonomously manage its virtual memory (VM) hardware without host intervention. This greatly reduces overhead with respect to host-side virtual-memory management (VMM) solutions while retaining flexibility. The design offers the possibility for collaborative IOTLB management, e.g., to exploit application-level knowledge available at offload time on the host side. Our solution exploits an accelerator-side helper thread concept. It is based on a miss-handling thread (MHT) that manages the VM hardware using an underlying VMM software library which encapsulates all the functionality for tightly-coupled interaction with the hardware, and requires only minimal hardware modifications to the design presented in Chap. 4. We have validated our design with a set of parameterizable bench- marks and real-world applications capturing the most relevant memory access patterns from regular and irregular parallel workloads. Com- pared with host-side IOTLB management, the PMCA performance improves substantially. Even for purely memory-bound kernels, the performance lies within 50% of and ideal design, i.e., a TLB with zero look-up and miss latency, which is out of reach even for today’s hardware IOMMUs [87]. This chapter is organized as follows. Sec. 5.1 discusses related work. Sec. 5.2 gives necessary background information and analyzes the limitations of host-side IOTLB management. We present our solution to this problem in Sec. 5.3. Sec. 5.4 gives experimental results. 5.1. RELATED WORK 85

5.1 Related Work

To reduce the potentially high overheads associated with host-based IOTLB management [37,85,86], the latencies associated with handling interrupts on the host OS must be avoided and IOTLB misses must be handled closer to the place where they occur, i.e., in the IOMMU. One possibility to achieve this is by using a dedicated processor core inside the IOMMU [88] or dedicated hardware page table walk (PTW) engines. Similar to the fully hardware-managed IOMMUs found in today’s high-end SoCs [7,8,28], also the research community has come up with proposals relying on hardware PTW engines [10,11,64,80,89], possibly combined with specialized memory allocators [65–67], for managing the IOTLB without frequent host intervention. Intuitively, hardware-based IOTLB miss handling allows for the lowest miss-handling latency and highest performance. However, the design of suitable hardware that copes with the high degrees of parallelism in modern accelerators is not a trivial task and typically results in heavy overprovisioning and resource-costly, complex and multi-threaded PTW engines combined with hardware caches [11, 28, 64]. In addition, the availability of full-fledged hardware IOMMUs does not yet automatically guarantee for the best performance. Today’s OS frameworks such as the Linux IOMMU API do not necessarily allow the hardware to operate on the page table of the user-space host process. Instead, kernel-level drivers may be required to build up a dedicated I/O page table and explicitly map shared memory pages after pinning them. While this concept aims at protecting the host from faulty DMA devices and drivers operating directly on kernel-space memory, it can also be a major performance bottleneck in context of high-bandwidth network and storage devices. Recent works [24, 25] target at mitigating the two main bottlenecks of the corresponding OS frameworks, i.e., the scalability of I/O virtual address (de-)allocation and IOTLB invalidation. To overcome the inefficiencies of the IOMMUs found in today’s processors, a new IOMMU architecture for accelerator-centric systems has been proposed that uses local TLBs but offloads the PTW to the hardware MMUs of the host CPU, which requires modifications to the host hardware [87]. 86 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

The design presented in this chapter exploits a novel accelerator-side helper thread concept to leverage available compute resources for managing the VM hardware. This allows to reduce the performance gap between hardware IOMMUs and software-managed SVM designs targeting HESoCs. Our design neither requires modifications to the host architecture nor restricts the programmer to use dedicated memory allocators. Low-level details are taken care of by the infrastructure without additional overheads at offload time.

5.2 Background

In this section, we discuss the specifics and challenges of managing VM on a PMCA in a HESoC on the basis of a system with a 32-bit ARM host processor running Linux. We first discuss structure and entries of the page table (PT) and how the PT is used for virtual-to-physical address translation. As the PT can be cached at different locations, we then discuss the related coherence issues. Next, we discuss the RAB, a lightweight hardware module that enables VM on PMCAs. We conclude this section by examining the limitations of the state- of-the-art solution and derive the challenges our implementation (in Sec. 5.3) has to solve.

5.2.1 Linux Page Table PTs store the virtual-to-physical address translations of all processes. There is one PT per process. All PTs are managed by the OS, which creates, modifies, and removes PT entries. In addition to the OS, the PT is read by the MMU of the CPU. The layout of the PT is defined by the CPU architecture. The Linux kernel generally uses the four-level PT shown in Fig. 5.1. The individual tables are called Page Global Directory (PGD), Page Upper Directory (PGD), Page Middle Directory (PMD), and Page Table Entry (PTE). The host processor we use, a 32-bit ARMv7 CPU without Large Physical Address Extension (LPAE), however, uses a two-level PT. To match this architecture, Linux uses only the lower two of its four tables (i.e., the PMD and the PTE) and sets the base address of the PT to the base address of the PMD. Another difference between processor 5.2. BACKGROUND 87

Virtual Address: xPGD Inde PUDx Inde PMD Index xPTE Inde Page O set

Base A dress

Page PTE PGD PUD Page Table PMD

Figure 5.1: The Linux page table. Elements that are grayed out are unused by the host architecture used in this chapter.

1st level, PMD 2nd level, PTE 2048 Linux entries = 16 KiB 512 Linux entries = 4 KiB

1 Linux PTE HW o set = 2 HW PTEs = 2 KiB

Linux entry = 8 B HW PTEs accessed by MMU. HW entry = 4 B

Figure 5.2: Differences between Linux and processor page table. and kernel is the number of entries per level: The Linux kernel uses 2048 and 512 entries in the first and the second level, respectively. The processor, on the other hand, uses 4096 and 256 entries in the two levels. As visualized in Fig. 5.2, the kernel compensates for this difference by duplicating the information in the second level (PTE) and by defining additional entries suitable for the processor in the first level (PMD). We will not discuss this mechanism in more detail because we work with the Linux representation of the PT. As a consequence, our implementation can be adapted for all host architectures that are supported by the Linux kernel. Entries in the upper-level table contain the physical base address of a lower-level table. Entries in the lower-level table contain the physical base address of a page and multiple flag bits, of which the read-only and the user-mode bit are relevant for us: If the user-mode bit is zero, unprivileged processes (e.g., all user-space applications) may not access the page. If the read-only bit is set, processes may not write 88 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT the page. If the physical address of an entry is null, the given virtual address (VA) does not translate to a physical address. The translation from a virtual to a physical address for a given process is done in a page table walk (PTW): First, the 11 most- significant bits of the VA are used to index the PMD that is located at the process-specific PT base address to get the physical base address of the PTE. Second, the next 9 bits of the VA are used to index the PTE to get the physical base address of the page. Finally, the 12 least-significant bits of the VA are added to the page base address to obtain the physical address.

5.2.2 Page Table Coherence The PTs of all processes are stored in main memory due to their size. However, the virtual-to-physical address translations they contain are used very often, and accessing main memory on every translation would cause huge run-time overheads. To avoid this, the most frequently used translations are cached. There are two types of caches: (IO)TLBs store individual address translations and are a core component of every (IO)MMU. They allow the direct translation of a set of recently used addresses without having to walk the PT. In addition to TLBs, parts of tables may be present in caches in the memory hierarchy after a PTW. Similarly, some high-performance (IO)MMUs even include private caches to buffer entire tables [28]. These caches speed up (a part of) the PTW, because some tables are available in local low-latency memory. Regardless of their type, caches introduce the problem of coherence: when the OS modifies an entry in the PT in main memory, all cached versions of this entry are invalid and may no longer be used. To ensure this, an invalidation is broadcast whenever a PT entry is changed [90]. Caches are obliged to listen to these broadcasts and may not use invalidated entries any longer. There is a subtle difference in how the two cache types deal with invalidations: Caches in the memory hierarchy are subject to a cache-coherency protocol. TLBs, on the other hand, are not part of the memory hierarchy and thus cannot use cache invalidation broadcasts. To invalidate a TLB entry, an additional TLB invalidation that contains the modified virtual address is broadcast. On the ARMv7 host processor used in this 5.2. BACKGROUND 89 work, TLB invalidations are propagated through distributed virtual memory (DVM) transactions [91].

5.2.3 Remapping Address Block The remapping address block (RAB) connects the PMCA to the shared main memory interconnect and enables the host and the PMCA to communicate through SVM as discussed in Sec. 4.2. Similar to the IOMMUs found in HPC systems, the RAB performs the virtual-to- physical address translation for the accesses of the accelerator to SVM. Unlike a full-fledged hardware IOMMU, however, the RAB is completely managed in software by a kernel-level driver module running on the host. The top-level architecture of the RAB is shown in Fig. 5.5 (uncol- ored elements). At its heart is an IOTLB of configurable size. This IOTLB is fully associative and looks up translations in a single clock cycle. An IOTLB entry (called RAB slice) contains a virtual address range of arbitrary length, a physical address offset, and two flag bits: slice and write enable. The widely-adopted AXI4 [91] protocol is used on all interfaces. When a transaction from the accelerator to shared memory arrives on the slave port, the metadata of the transaction (i.e., virtual address, length, ID, and read/write) are fed to the control block that interfaces the IOTLB. In the case of a RAB miss—i.e., if the virtual address does not match any enabled RAB slice or if a write has been requested but is not permitted—the RAB drops the transaction, responds with a slave error, stores the metadata inside the miss FIFO, raises a miss interrupt, and processes the next transaction. In the case of a RAB hit, the RAB translates the virtual address of the transaction to the physical address, forwards the transaction through the master port, and processes the next transaction (returning the response of the downstream slave as it becomes available). Supporting RAB misses requires a non-intrusive extension of the PMCA cluster: The RAB response is buffered in a low-latency, PE- private TRYX register. Each PE reads this register after every virtual memory access to check for a RAB miss. In the case of a miss, the corresponding PE goes to sleep. It is woken up as soon as the miss is handled and then retries the memory access. The necessary 90 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

Host PMCA

CPU CPU Cluster Cluster L1 Mem L1 Mem L1 $ L1 $ L2 Mem MMU MMU Cluster L2 $ L2 $ RAB L1 Mem

L1 Data Memory

CoherentCohereCoheoherent Interconnectconnec Bank Bank Bank Bank

NI

Interconnect Low-LatencyLow-Lat Interconnect DMA PE PE

Main MemoryMem Shared L1 I$

handle IRQ get RAB miss con fig RAB slice schedule Miss Handler Page Table Walk wake core up 15%15% 35%35%35% 4% 30% 12% 4% t 5,400 PMCA clock cycles

Figure 5.3: Interactions (top) and timeline (bottom) of handling RAB misses on the host. instructions are automatically inserted by a compiler extension. Other PEs and DMA engines can continue accessing SVM while the miss is being handled. In our first implementation presented in Chap. 4, RAB misses are handled by the host CPU. For this, the miss interrupt and the configuration interface of the RAB are connected to the host, and a kernel driver is registered to handle that interrupt. When an interrupt occurs, the driver schedules the miss handler routine and clears the interrupt. When invoked, the handler then reads the miss FIFO in the RAB, walks the PT of the offloaded process, sets up a new RAB slice, and wakes up the PE that has caused the miss.

5.2.4 Host-Based RAB Miss Handling

While handling RAB misses on the host comes at minimal hardware cost and is simple from the view of the PMCA, it adds an extensive run-time overhead to every miss. On our evaluation platform, handling a single RAB miss on the host takes an average of 5,400 PMCA clock cycles. Fig. 5.3 shows the interactions (top) and the timeline (bottom) of host-based miss handling. 5.3. INFRASTRUCTURE 91

After a RAB miss, an interrupt from the RAB arrives at the host and is handled by one of the CPU cores (dark gray); this takes ca. 15% of the total time. Then, the kernel thread executing the RAB miss handler in process context is scheduled (light gray), which takes another 35% of the time. This step is required because some functions used in the miss-handling routine may sleep and thus cannot be executed in interrupt context. The miss-handling routine first reads the RAB miss FIFO (yellow, ca. 4% of the time) and then translates the virtual to a physical address using get user pages() (blue, ca. 30% of the time). Next, the routine configures a RAB slice in a series of transactions (orange, ca. 12% of the time) and finally wakes the sleeping PE up (red, ca. 4% of the time). In conclusion, there are two main contributors to the prohibitively large run-time overhead: multiple interactions between loosely-coupled components and the long scheduling delay to execute the miss handler on the host. Our solution, described in the next section, reduces the extensive run-time overhead while maintaining the standard compliance and flexibility of the host-based solution. This will make VM viable also for applications that use VM extensively and potentially cause many RAB misses.

5.3 Infrastructure

The PMCA uses a multiple-instruction multiple-data (MIMD) execu- tion model as visualized in Fig. 5.4. Multiple parallel worker threads mainly operate on data in the shared L1 SPM. Accesses to SVM are protected by load and store macros 1 automatically inserted by a compiler extension as discussed in Sec. 4.2. These macros protect a PE from using an invalid response returned by the hardware upon a RAB miss by putting the PE to sleep. To reduce the problematic overhead caused by handling RAB misses on the host, we must instead handle misses right where they occur: on the PMCA. For this, we add a miss-handling thread (MHT) to the execution model. Upon a RAB miss 2 , the runtime starts an MHT that handles all outstanding RAB misses through our VMM library 3 . After every handled miss, the MHT sends a signal to the runtime that wakes up the sleeping PE 4 . In addition, worker threads can explicitly call the VMM library to 92 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

MHT Worker Threads

load() load() ✔ ✔ RAB 1 store() ✔

load() ✘ 3 2 load() ✔ store() ✔ VMM store() ✔ 4 load() ✔ map() 5 dma() unmunmap()ap() Time

Figure 5.4: PMCA execution model with miss-handling thread (MHT). map and unmap pages, e.g., before setting up DMA transfers 5 . To let the PMCA operate its VM hardware autonomously through the VMM software library (Sec. 5.3.3), two small modifications to the RAB hardware presented in Sec. 4.2.1 are required. First, the PMCA must be able to read the PT of the user-space process in main memory coherently with the caches of the host (Sec. 5.3.1). Second, the PMCA must have access to the configuration interface of the RAB (Sec. 5.3.2). Finally, handling RAB misses on the PMCA instead of on the host requires two small changes to the driver on the host: During the offload, the driver must set up a RAB slice that enables the PMCA to access the PT. For this, the driver gets the physical address of the initial level of the PT (i.e., in our case, the PMD) and sets up a cache-coherent read-only slice for it. Additionally, the RAB miss interrupt handler on the host must be deactivated.

5.3.1 Page Table Coherence There are two issues related to PT coherence (Sec. 5.2.2): all accesses to the PT must be coherent with the caches of the host, and all TLBs must respect TLB invalidations. We solved these issues as follows.

Cache-Coherent Memory Accesses The PMCA accesses the PT coherently with the caches of the host for two reasons: First, this removes the need to flush caches every time the host changes the PT while the accelerator is running and 5.3. INFRASTRUCTURE 93 on every offload. Second, this allows to exploit the memory hierarchy of the host to cache parts of the PT, reducing the average access time to the PT without the expensive dedicated hardware caches as used, e.g., in some IOMMUs. Many modern SoCs offer an Accelerator Coherency Port (ACP) that enables accelerators to access main memory coherently with the caches of the host [55, 92]. To access the PT coherently, we could route all memory accesses through the RAB to the ACP. However, application data may be known to be uncached by the host at accelerator run time. In this case, accessing memory through the caches of the host unnecessarily increases the access latency and causes interference. Thus, we added a multiplexer that routes cache-coherent accesses through the ACP and non-coherent accesses directly to the DRAM interface (drawn yellow in Fig. 5.5). To control this multiplexer, we added one bit that defines cache coherence to each RAB slice.

from PMCA SoC Bus Filter

g ✁ IOTLB Miss from Iconn FIFOs Con Host Virt Phys Virt Addr AXI-Lite Addr Addr +Meta Slave

Control Reg Inval Virt Addr Phys +Meta Addr R A B to DRAM from PMCA Soc Bus to ACP

AXI4 Resp AXI4 Slave Gen Master

Figure 5.5: The RAB with hardware extensions for on-accelerator VM management highlighted in color.

TLB Invalidations Ensuring coherence of the IOTLB in the RAB is more involved, because the RAB does not support TLB invalidations through DVM transactions. We solved this issue with another low-overhead hardware- software solution: We added a register to the RAB (drawn orange in 94 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

Fig. 5.5) and amended the TLB invalidation routine in the Linux kernel to write the invalidated virtual address to that register. When an address is entered in the register, the RAB disables the corresponding slice within the next clock cycle and before processing any outstanding translations. A single register is sufficient because the RAB processes invalidations at a much higher rate than the OS generates them. This solution requires only little extra hardware and, since TLB invalidations happen rarely, the associated run time overhead is acceptable: On the evaluation platform (described in Sec. 5.4.1), writing the invalidation register in the PMCA takes 200 host clock cycles on average. This is comparable to the invalidation of a host TLB entry that takes 130 cycles on average.

5.3.2 Securely Configuring the RAB on the PMCA The PMCA requires hardware access to the RAB configuration to be able to set up slices. To share the configuration port of the RAB among host and PMCA, we added an interconnect in front of the port and attached the PMCA SoC bus and the existing host master (light blue in Fig. 5.5). While this solution is simple and functionally sufficient, it would elude the isolation aspect of VM: If the PMCA was free to write any configuration to RAB slices, any application running on the PMCA could grant itself unrestricted access to any memory region, including regions that belong to the OS kernel. We prevent illegal RAB configurations with a hardware filter on the configuration port (dark blue in Fig. 5.5). This filter ensures two properties based on address and content of incoming writes: First, only the host is allowed to write the one slice that maps the initial level of the PT. Second, the accelerator may only configure slices according to the response of a PT look-up. To ensure the second point, the filter snoops the read request and response channels between the RAB and the clusters of the PMCA (triangular filter port in Fig. 5.5) for look-ups in the PT. The physical address, the access permissions, and the address range returned by a look-up are stored. On the next write to the RAB configuration by the accelerator, the values to be written are compared with the stored values of the last look-up: If one of them does not match or if the page is not accessible from user space, the write request is dropped and an error is returned. Applications 5.3. INFRASTRUCTURE 95 could still escape this protection by faking an extra level in the PTW with one of their data pages. To prevent this, the filter keeps track of the number of indirections that occur during the PTW, which is fixed for any given host architecture. This lightweight hardware solution enables untrusted accelerator software to configure the IOTLB while maintaining the isolation guarantee of VM. It does not have any impact on run time and comes at negligible hardware cost.

5.3.3 VMM Library for the PMCA There are three basic requirements on this library that is located between the application programmer on one side and PMCA hardware as well as host-dependent data structures on the other side. First, the library must expose a simple API that remains stable regardless of variations between different host architectures or evolution in PMCA hardware. Second, the implementation of the library must be flexible and extensible to accommodate different host architectures and evolv- ing PMCA hardware. Third, the library is highly performance-critical and must guarantee correctness properties such as strict ordering of memory transactions. To meet the first two requirements, the library (shown in Fig. 5.6) is organized in layers and composed of modules. To meet the third requirement, we implement these abstractions with no overhead: the compiler expands all function calls inline and reduces all data structures to collections of primitive types. In the remainder of this section, we will discuss the modules and their interaction from top to bottom.

Top-Level VMM Module The VMM library provides two different sets of functionalities in its interface: one to handle all outstanding RAB misses and automatically map the required pages V1 and one to manually map and unmap pages V2 . In the most common use case of this library, an MHT calls the function to handle RAB misses V1 upon a RAB miss interrupt. The library will then dequeue the first miss from the RAB miss queue and translate its virtual address to a physical address. Next, it will select a RAB slice to replace and configure that slice accordingly. Finally, it will 96 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

Miss Handler Thread Worker Threads Application s V1 V2 VMM handle_misses() {un,}map_page{,s}() VMM Top Levelr

H1 Hvirt_addr_to_page_phys_addrost Arch.() Host HALe y C1 R1 R2 R3 Cluwake_coreste() r get_miss() slice_...() config_slice() PMCA HAL RAB a Cluster RAB Config Main Memory Hardware L

Figure 5.6: Overview of the VMM library on the PMCA. Arrows indicate usage of functions and data structures in the interface. wake up the PE that caused the miss. The other use case is the explicit mapping of pages V2 , e.g., in preparation of a DMA transfer. This works the same as V1 without the functions related to miss handling. An advantage of this software implementation is that the algorithm to replace slices can be modified easily. We focused on a simple FIFO policy due to its low time and space complexity, but certain applications could benefit from a more complex algorithm such as LRU.

Host Hardware Abstraction Layer The host hardware abstraction layer (HAL) provides a normalized interface to translate addresses on different host architectures. Its implementation structure allows to add support for every architecture supported by the Linux kernel with little extra development efforts. This module translates virtual to physical addresses H1 by walking the PT of the host. The translation starts at the first-level table, which has been mapped by the host at offload. At each level, the function first calculates the table index from the virtual address and reads the corresponding table entry. It then checks if this entry contains the physical address of a table in the next-lower level; if so, it sets up a new RAB slice for cache-coherent read-only access to that table and proceeds with the next level. At the last level, the table entry contains 5.3. INFRASTRUCTURE 97 the physical address of a page along with access permissions. Thus, the module finally checks if the page is accessible from user space and maps it with the appropriate read/write permissions. At all levels but the last, a RAB slice mapping a table is set up. When the translation of a virtual address (at least partially) requires the same tables as the previous address, however, setting up these tables again is unnecessary. Our implementation avoids such redundancy by caching the addresses of the currently mapped tables and starting the PTW at the first level that cannot be reused. This is especially beneficial for higher-level tables, which are reused often because they cover a large range of addresses. Our results show that this optimization already for a two-level PT reduces the run time of a PTW by up to 65% on the evaluation platform.

PMCA Hardware Abstraction Layer The PMCA HAL implements the uniform access principle for memory- mapped hardware components of the PMCA by providing a consistent interface. This allows aggressive compiler optimization throughout this performance-critical library while still guaranteeing correctness at the instruction level, e.g., with respect to memory access ordering: constructs such as volatile variables and fixed-width types are used consistently where required—and only where required. The RAB module controls RAB slices R3 and dequeues misses from the hardware FIFO R1 . Parameters for a RAB slice are: the index of the slice, the virtual address range, the base physical address, access permissions, and cache coherency. If the parameters are valid, the implementation then disables the target slice, writes the new configuration in a sequence of memory transactions, and re-enables the slice. Other operations on slices are R2 : disabling it, determining whether it maps a given virtual address, and determining whether it is enabled. The implementation is thread-safe: those operations that require more than one transaction lock the slice on which they operate by means of synchronization hardware in the PMCA. From the cluster module, the VMM library currently only uses the function to wake up a given PE C1 that has gone to sleep after accessing a virtual address that misses in the RAB. This function triggers an interrupt to the PE in the event hardware unit of the PMCA. 98 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

5.4 Experimental Results

Before discussing the results, we first give an overview of the evaluation platform and the benchmark applications visualized in Fig. 5.7 and Fig. 5.8, respectively.

5.4.1 Evaluation Platform Our evaluation platform is based on the Xilinx Zynq-7045 SoC [55]. It features a dual-core ARM Cortex-A9 CPU that is used to implement the host of the HESoC. The cores have separate L1 instruction and data caches with a size of 32 KiB each. Further, they share 512 KiB of unified L2 instruction and data cache that connects to the high-priority port of the DRAM controller. The host is running Xilinx Linux 3.18 with swapping and the LPAE disabled. The programmable logic of the Zynq is used to implement a cluster- based PMCA architecture [71]. The PEs within a cluster feature 8 KiB of shared L1 instruction cache and share 256 KiB multi-banked tightly- coupled data memory as L1 SPM. Ideally, every PE can access one word in the L1 SPM per cycle. Every cluster features a multi-channel DMA engine that can be quickly programmed by the PEs by just writing 4 low-latency, PE-private registers and thus enables fast and flexible movement of data between L1 and L2 memory or shared DRAM. The event unit in the peripherals of each cluster is used for intra- and inter-cluster synchronization, to put PEs to sleep, and to wake them up. The PMCA is attached to the host as a memory-mapped device and is controlled by a kernel-level driver module and a user-space runtime. The host and the PMCA share 1 GiB of DDR3 DRAM. The RAB uses one port with 32 slices for PMCA-to-host communication. It connects the PMCA to the HP AXI slave port of the DRAM controller and to the ACP of the Zynq. The latter allows the PMCA to access the shared main memory coherent to the data caches of the host. The configuration interface of the RAB is accessible to both PMCA and host through the general-purpose AXI master port. In case a RAB miss happens, this is signaled back to the PE using the per-core, low-latency TRYX register as described in Sec. 4.2.2. This platform enables us to study and evaluate the system-level integration of a PMCA into a HESoC. Thus, we did not optimize the 5.4. EXPERIMENTAL RESULTS 99

Host PMCA

A9 CPU A9 CPU L2 Cluster Core 0 Core 1 SPM DMA L1 SPM

MMU MMU L1 SPM Bank L1 SPM Bank L1 SPM Bank L1 SPM Bank Mailbox Cluster L1 I$ L1 D$ L1 I$ L1 D$ X-Bar Interconnect L1 SPM ACP SoC Bus

Coherent Interconnect Cluster Bus Per2AXI CFG Cluster TRYX TRYX TRYX System RAB AXI2Per DMA L2 $ L1 SPM Event Unit IRQs AXI AXIHP0 GP0 PE PE PE

DDR DRAM Controller Timer Bus Peripheral

Shared L1 I$ DDR DRAM

Figure 5.7: Architecture of the evaluation platform.

PMCA for the implementation on the FPGA; the FPGA implementa- tion should be seen as an emulator instead of a full-featured accelerator. We adjusted the clock frequencies of the different components to obtain ratios similar to a real HESoC with host and PMCA running at 2133 MHz and 500 MHz, respectively. The DDR3 DRAM is clocked at 533 MHz. The hardware modifications to the RAB add only minor contributions to the overall hardware complexity. The modified RAB accounts for less than 6% and 10% of the LUT and FF resources of the entire PMCA featuring a single cluster with eight PEs, respectively. For future evolutions of our SVM system, we have migrated the entire system to a bigger evaluation platform combining a more recent, multi-cluster, ARMv8 host processor with an FPGA fabric capable of implementing multiple PMCA clusters [40]. Since the VMM library presented in this work operates on the Linux PT, it is fully compliant with the new host architecture.

5.4.2 Synthetic Benchmark Descriptions To evaluate the performance of our SVM system under various condi- tions including identifying its limits, we have used three configurable benchmark applications. They were obtained by extracting critical phases from real-world applications suitable for the implementation on a HESoC, and by parameterizing them to cover a large parameter space. They exhibit main-memory access patterns representative 100 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT for various application domains. Operating on pointer-rich data structures, the support for SVM is badly needed. Otherwise, developers need to completely rethink the application, which prevents efficient implementations on heterogeneous platforms at reasonable design-time effort. Moreover, the access patterns exhibited by such operations are highly irregular and thus represent a worst-case scenario for the SVM subsystem.

a) PC 0 b) RFT Feed every sample Binary test n to every 0 defined by 1 tree. vertex 8 1 2 payload.

Start with 2 3 m Samples 3 4 5 6 Vertex 0. Traverse graph linearly. n trees id = 0 id = 1 id = 2 id = n n_suc=1 n_suc=3 n_suc=0 n_suc=2 id = 0 leaf=0 limit payload *suc_ptr *suc_ptr *suc_ptr *suc_ptr id = 1 leaf=0 limit payload payload payload payload payload id = 2 leaf=0 limit payload id = 3 leaf=1 limit payload id = 4 leaf=1 limit payload Access &vert[1] &vert[2] &vert[2] id = 5 leaf=1 limit payload &vert[3] successor &vert[3] id = 6 leaf=1 limit payload &vert[8] vertices. n trees

Figure 5.8: The synthetic benchmark applications: a) pointer chasing (PC) and b) random forest traversal (RFT).

Pointer chasing (PC): This benchmark is a representative exam- ple for a wide variety of pointer-chasing applications from the graph processing domain [82, 93]. Prominent examples include PageRank (PR), breadth-first search, shortest path search, clustering, and nearest neighbor search. Due to the irregular and data-dependent access pattern to shared memory and low locality of reference, they represent worst-case scenarios for a virtual memory subsystem. The principle of the benchmark is visualized in Fig. 5.8 a). The host builds up the graph and stores the vertex information in a single array in main memory. Every array element stores the number of successors of the corresponding vertex, a pointer to an array of successor vertex pointers, and a configurable amount of payload data. For every vertex, the PMCA reads the number of successors and copies the payload data and successor pointers to the L1 SPM using DMA. Then, it performs a configurable number of computation cycles on the payload data and writes the payload data to the successors in shared main memory. 5.4. EXPERIMENTAL RESULTS 101

Random forest traversal (RFT): This benchmark operates on regular, binary decisions trees of configurable size typically used for regression and classification [83]. The host generates the decision trees in virtual memory and passes virtual address pointers to both the root vertices and the samples to be classified to the accelerator. The trees themselves are stored in a large array as shown in Fig. 5.8 b). Every vertex or array element stores a configurable amount of payload data and a limit for the binary decision. When a sample arrives at a non-leaf vertex, the accelerator loads the corresponding payload data from shared memory using DMA, performs a configurable amount of computation cycles using the sample and the payload data, and selects the next vertex based on the outcome of the computation. Upon arriving at a leaf vertex, the index of the leaf vertex is stored. Memory copy (MC): This application features a highly regular access pattern to shared memory and is a representative example for streaming applications. It simply lets the host create a buffer of configurable size in virtual memory and then passes a pointer to the buffer to the PMCA. The PMCA uses DMA transfers to copy the buffer from shared memory into the L1 SPM at maximum bandwidth.

5.4.3 Miss-Handling Cost We used a synthetic benchmark application to profile the VMM library and compare it with host-based RAB management. The host CPU allocates a large array and passes a pointer to this array to the PMCA. The PEs of the PMCA then issue read and write accesses to this array. To measure the execution time of the different sub-functions of the RAB miss-handling routines, the application uses the internal performance counters of the PEs and the peripheral timers of the cluster. On average, the handling of a RAB miss using the VMM library takes 450 PMCA clock cycles. Around 50% of this time is spent in memory accessing the PT. The reconfiguration of RAB slices (to access currently unmapped PT sections and to map the requested user-space page) accounts for 20% of the execution time. The remaining time is spent on various sub-tasks of the VMM library that can be executed internal to the cluster. These include computing addresses of the next PMD and PTE entries, checking access privileges, handling the RAB slice replacement strategy, and waking the sleeping PE up 102 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

Table 5.1: Benchmark parameters.

Pointer chasing Random forest traversal #Vertices 10 k #Tree Levels 5 - 16 Vertex Size [B] 44 - 2,060 Vertex Size [B] 28 - 1,036 #Cycles per Vertex 10 - 1,000 #Cycles per Vertex 10 - 1,000

Memory copy #Iterations 1 - 64 Data Size [KiB] 64 - 1,024 after handling the miss. Overall, handling a RAB miss directly on the PMCA using the proposed VMM library is 11 times faster than handling misses on the host.

5.4.4 Synthetic Benchmark Results The three benchmarks were run using the parameters summarized in Tbl. 5.1 to test the presented SVM subsystem under various workload conditions. The selected parameter sets represent a mix of real use cases [83,93], problem sizes still bearable by embedded systems, and performance transition points of the evaluation platform1. We selected a randomly distributed graph of type Erd˝os-R´enyi for PC and configured RFT to sort random input numbers. As a consequence, access patterns to shared memory are highly irregular and randomized. We measured the execution time on the PMCA, both when using the PTW design presented in this work and when letting the host handle the RAB misses. We compare these two designs with an ideal SVM subsystem, i.e., a TLB with a single-cycle look-up latency that always contains the requested mapping and thus never misses. While ideal SVM would be desirable, even full-fledged hardware IOMMUs today are far from reaching this performance: a recent study has shown that their average performance on a mixed set of accelerator

1For example, we found that graphs with larger vertex count do not notably change the results for PC as long as the total graph size is larger than the amount of memory that can be remapped with the available 32 TLB entries (128 KiB). 5.4. EXPERIMENTAL RESULTS 103 workloads is less than 13% of ideal SVM [87]. We modeled the ideal SVM subsystem by letting the host copy the shared data to a reserved, physically contiguous section and translate any virtual address pointers in the shared data. While this allows for zero overhead at PMCA run time, it leads to considerable design-time overheads and very high offload cost, which are to be avoided by using SVM. Pointer chasing (PC): Fig. 5.9 shows the performance normal- ized to ideal SVM for different numbers of computation cycles per vertex and increasing vertex sizes, i.e., the total amount of data in the linear array per vertex, in a) – d). Given a specific vertex size, varying the number of computation cycles per vertex changes the operational intensity defined as the number of operations performed per data transferred between shared main memory and the local SPM of the PMCA. The operational intensity is a characteristic property of a specific algorithm implementation. It is denoted on the x-axis of the plots. Since the graph cannot be remapped by the available TLB entries at once and since the access pattern is random, PC is dominated by RAB miss handling. Thus, letting the PMCA manage the RAB is always substantially better than involving the host. The proposed SVM system achieves between 60 and 88% of the performance of ideal SVM for operational intensities between 0.28 and 28 cycles per byte. For example, a single-precision floating-point implementation of PR performs 1 division, 1 multiplication per vertex and 1 addition per successor, and accesses 20 B of data for these operations. On the ARM Cortex-A9 host CPU, this leads to an intensity of 1.2 cycles/B [94]. Random forest traversal (RFT): The results for RFT are shown in Fig. 5.10. The x-axis of the plots denotes the number of tree levels. Every plot shows two pairs of curves corresponding to different operational intensities. It can be seen that changing the vertex size from 28 to 268 B (Fig. 5.10 a) to b) ) mainly shifts the transition point beyond which letting the PMCA manage the RAB becomes increasingly beneficial. This transition point corresponds to where the total tree size becomes greater than the capacity of the TLB. For smaller tree sizes, the application causes only compulsory TLB misses, i.e., TLB misses happening on the very first access to every memory page. Above the transition point, more and more TLB misses to previously mapped but at some point discarded pages (capacity misses) are generated until the performance saturates. In this regime, 104 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

a) Vertex Size = 44 B b) Vertex Size = 140 B Graph Size = 0.44 MiB Graph Size = 1.35 MiB 1.0

0.8

PMCA 0.6 PMCA

0.4 3.3x Speedup

Host 0.2 Host

Performance Normalized to Ideal SVM Normalized Performance 0.0 0.28 2.76 27.64 0.09 0.92 9.2 Operational Intensity [Cycles/B] Operational Intensity [Cycles/B]

c) Vertex Size = 524 B d) Vertex Size = 2060 B Graph Size = 5.02 MiB Graph Size = 19.67 MiB 1.0

0.8

0.6

PMCA 0.4 PMCA

0.2 Host Host

Performance Normalized to Ideal SVM Normalized Performance 0.0 0.03 0.25 2.51 0.01 0.06 0.64 Operational Intensity [Cycles/B] Operational Intensity [Cycles/B]

Figure 5.9: PC results for different vertex/graph sizes and resulting operational intensities. The labels PMCA and Host denote the unit that manages the RAB in the corresponding curve. 5.4. EXPERIMENTAL RESULTS 105

a) Vertex Size = 28 B b) Vertex Size = 268 B c) Vertex Size = 1036 B 1.0 PMCA I=18 C/B PMCA I=2 C/B PMCA I=0.5 C/B 0.9 Host I=18 C/B PMCA 0.8 I=0.2 C/B 0.7 PMCA 2x Speedup I=0.05 C/B 0.6 2.4x Speedup PMCA I=2 C/B 0.5 0.4 Host I=2 C/B Host I=2 C/B 0.3 Host I=0.5 C/B 0.2 Tree Size Tree Size Host I=0.05 C/B 0.1 = TLB Capacity = TLB Cap. Host I=0.2 C/B Tree Size = TLB Capacity

Performance Normalized to Ideal SVM Normalized Performance 8 10 12 14 16 6 8 10 12 14 16 6 8 10 12 14 #Levels #Levels #Levels

Figure 5.10: RFT results for different vertex and tree sizes, and resulting operational intensities. The labels PMCA and Host denote the unit that manages the RAB in the corresponding curve. the proposed design is up to 2.4 times faster than when the host manages the RAB. Also, it can be seen that increasing the vertex size (Fig. 5.10 a) to c) ) does not notably affect performance in the regime dominated by capacity misses. The distance between a vertex and its successors increases with the depth in the tree. As the size of all vertices in a tree level reaches the size of a memory page, accessing the successors of a vertex on the next level always generates a TLB miss. A larger vertex size means that more data on the remapped page is accessed and helps to amortize the cost of the miss despite a potentially very low operational intensity. Memory copy (MC): Fig. 5.11 shows the performance for MC normalized to the ideal SVM system for different data sizes. In practice, the size of the L1 SPM might not suffice to hold the entire input data at once and multiple iterations over the same data might be required to apply a specific application kernel. The number of iterations are denoted on the x-axis. For data sizes smaller than the TLB capacity, performing multiple iterations helps to amortize the compulsory TLB misses occurring during the first iteration. This effect is much more pronounced when the host is managing the TLB because misses are more expensive in this case. As the data size becomes greater than the TLB capacity, this effect diminishes. Also, the relative performance of the SVM schemes under comparison decreases with increasing data 106 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

a) Data Size = 64 KiB b) Data Size = 192 KiB c) Data Size = 1024 KiB 1.0 0.9 PMCA 0.8 4x Speedup PMCA 0.7 PMCA 0.6 3.8x Speedup 0.5 Iterations he p 0.4 to amortize 0.3 compulsory Host misses. Host 0.2 Host 0.1

Performance Normalized to Ideal SVM Normalized Performance 1 16 32 64 1 16 32 64 1 16 32 64 #Iterations #Iterations #Iterations

Figure 5.11: MC results for different data sizes and variable number of iterations. The labels PMCA and Host denote the unit that manages the RAB in the corresponding curve. size. For larger data sizes, the number of DMA transfers increases while the size of the local SPM stays constant. Thus, also the total SVM overhead associated with the DMA transfers increases. Overall, letting the PMCA handle the TLB improves performance by factors of 3.8 to 4x compared with host-based management. MC offers a highly regular SVM access pattern known in advance. Our virtual memory management approach offers the flexibility to exploit this information on the host side at offload time for prefetch- ing (Chap. 3). Fig. 5.12 compares the achievable performance of such a scheme when copying a buffer with a size of 1 MiB. Clearly, this allows to improve performance beyond what is possible with purely reactive MH. The achievable performance is very close to that of ideal SVM.

1

0.8

0.6

0.4

ective Bandwidth ective 0.2 ff E 0 Normalized to Ideal SVM Ideal to Normalized Host MH PMCA MH Host Prefetch

Figure 5.12: MC effective bandwidth normalized to ideal SVM. 5.4. EXPERIMENTAL RESULTS 107

5.4.5 Real Application Benchmark Results

To evaluate the performance of the proposed SVM design for real-world applications operating on irregular, pointer-rich data structures, we have used the three applications described in Sec. 4.3.3 and App. A that combine the access patterns of the synthetic benchmarks. PR [82] is a typical example for a PC application. Since the PMCA imple- mentation of our evaluation platform does not feature floating-point units, we have used a custom fixed-point software library, which leads to an increased operational intensity. The vertices have a size of 20 B. The RHF application is part of the classification stage of an object detector [83]. Besides the actual classification stage exhibiting the access pattern of RHF on trees with 16 levels, also a set of filtering operations are executed on the PMCA. These filtering operations are well parallelizable, operate on the input images previously copied to the L1 SPM and are highly compute intensive. They can help to amortize potentially high overheads during phases operating on SVM. The transfer of the input image features an access pattern to shared memory very similar to MC. The third application performs FD using the well-known Viola-Jones algorithm [84] operating on degenerate decision trees. The access pattern to shared memory is similar to RFT. The intensive computation of the integral and squared integral images fed to the detector is also performed on the PMCA using data in the L1 SPM. Similar to RHF, the transfer of the actual input image features a streaming-type access pattern like MC.

Host MH PMCA MH 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Performance 0.2 0.1

Normalized to Ideal SVM Ideal to Normalized 0.0 PR PR RHFs RHFs RHFs FD 4k vertices 10k vertices 2 trees 4 trees 10 trees

Figure 5.13: Performance for real-world applications normalized to ideal SVM when letting the host or the PMCA manage the RAB. 108 CHAPTER 5. ON-ACCELERATOR VM MANAGEMENT

We have run the benchmarks for different data sets. Fig. 5.13 compares the measured performance normalized to ideal SVM when using the host or the PMCA to manage the RAB. For PR with 4, 000 vertices as well as for FD, the pointer-rich data structure can be remapped completely with the available RAB slices. Only few compulsory misses must be handled, which can be well amortized by the computation, independent of whether the PMCA or the host manages the RAB. As the number of vertices for PR increases to 10, 000, the overall execution time starts to get dominated by capacity misses. Letting the PMCA manage the RAB improves the performance by a factor of roughly 1.5. For RHF, the highly compute-intensive filter operations allow to partially amortize the many capacity misses during the classification phase (1.75 MiB of data per tree). As the number of trees and thus the number of capacity misses increases, more and more time is spent on handling these capacity misses while the amount of filter operations remains constant. Letting the PMCA manage the RAB using the VMM library presented in this work brings the accelerator performance to within 5% of ideal SVM.

5.5 Summary

In this chapter, we have presented a novel hardware-software solution for accelerator-managed SVM in HESoCs. Our solution is based on an accelerator-side helper thread using a VMM software library that enables the accelerator to autonomously perform the correctness- and performance-critical task of operating its VM hardware, including page table walking and IOTLB management. The library provides a simple API and is compatible with any host CPU architecture supported by the Linux kernel. In addition, the design retains the flexibility of a software-managed solution, which allows, e.g., the exploration of novel ideas such as managing the IOTLB in a collaboration among host and accelerator. We have validated the design on our evalua- tion platform using both parameterizable benchmarks and real-world applications covering various application domains including linked data structures. Letting the accelerator manage its SVM hardware autonomously eliminates frequent interventions of the host CPU and brings substantial performance improvements by factors of up to 4x 5.5. SUMMARY 109 compared with host-based IOTLB management. Compared with an ideal system for SVM, the performance lies within 50% for purely memory-bound applications and within 5% for real applications. To further increase the performance of our SVM system, the proposed design offers several options. First of all, the VMM library can be dynamically parallelized to multiple MHTs. This would allow to dynamically allocate more PMCA compute resources to SVM management when many misses occur simultaneously and thus reduce the average miss latency without costly overprovisioning for worst-case scenarios as employed in fully hardware-managed designs. Second, the flexibility of a software-managed solution paves the way for smart IOTLB prefetching, i.e., a prefetching scheme that exploits not only run-time information extracted from the observed SVM access pattern, but also higher-level information available at compile time as well as the state of the worker threads themselves. Compared to the simple prefetching schemes implemented using dedicated hardware in today’s systems, such a system allows for effective prefetching for a wide range of applications [41, 95]. Finally, using an alternative, optimized TLB design with substantially higher capacity for the RAB allows to reduce the number of capacity misses. This option is addressed in Chap. 6.

Chapter 6

SVM for FPGA Accelerators

Reducing the TLB-miss latency through on-accelerator VM manage- ment allows to effectively improve SVM and accelerator performance. However, the VM management is ultimately bound by the latency to main memory (see Sec. 5.4.3). To overcome this limitation, typical IOMMU designs feature additional address translation caches to locally cache the page table and reduce the latency of the PTW operation [28,64]. Alternatively, the architecture of the TLB itself can be optimized to better match the requirements of parallel accelerator architectures and to enable increased TLB capacity (fewer capacity misses) and thus less management overhead. In this chapter, we explore different design choices for the TLB, i.e., the most critical IOMMU building block. First, we consider a fully-parallel TLB with optimized look-up latency (1 cycle), in accordance to all state-of-the-art designs. A fully-associative design enables variable-sized mappings for flexible operation. Similar to any fast memory, the downside of such a solution is poor scalability to large sizes, which ultimately translates in higher TLB miss rates. Motivated by the observation that in many cases the traffic of parallel accelerators is more bandwidth-sensitive than latency-sensitive, we consider a second design that tolerates larger look-up times to enable much larger

111 112 CHAPTER 6. SVM FOR FPGA ACCELERATORS

TLB capacities and lower miss rates. We call these two designs the level-one (L1) and level-two (L2) TLB, respectively, following the traditional latency/size-based classification in memory hierarchies. In addition, we explore a third, hybrid TLB design combining the best of both schemes. For this exploration, we shift our focus on a different type of accelerator, namely generic, custom hardware accelerators deployed on FPGA. While for quite some time, the major FPGA vendors have had devices on the market that combine multi-core, general-purpose host processors with FPGA fabrics in HESoCs [55,92], SVM is still not widely adopted in such systems. Previous works on SVM for FPGA accelerators mostly aimed at accelerating the TLB management by using either a soft processor [88] or dedicated hardware [64–67,80,89,96]. In contrast, we propose an IOMMU architecture combining a flexible L1 TLB with a new, scalable and configurable L2 TLB optimized for FPGA deployment, and that supports TLB prefetching transactions. Together with the user-space runtime library and kernel-level driver for managing the hardware, the proposed design can serve as a complete, plug-and-play framework suitable for exploring transparent SVM for custom FPGA accelerators. Using a set of parameterized benchmarks extracted from real- world applications for which SVM is very beneficial to allow for a heterogeneous implementation at reasonable programming effort, we demonstrate the benefits of the proposed SVM framework. Our results show that unlike for CPUs where address translation has to be performed for every memory access, TLB look-up latency is not critical for accelerators relying on SPMs. In such architectures, the TLB is not in the critical path of the accelerator and address translation is primarily needed for latency-insensitive DMA transfers used to copy data between these SPMs and SVM. Relaxing the TLB look-up latency allows for the construction of larger TLBs thereby lowering overall miss-handling overhead. Compared to related works, our design allows to increase TLB capacity by factors of 16x and more while achieving lower overall resource cost and higher or comparable clock frequencies. Moreover, we show that queuing multiple outstanding TLB misses and handling them in batch mode is highly beneficial for parallel accelerator architectures and allows to improve performance over previous works optimized for low-latency miss handling using 6.1. RELATED WORK 113 costly, dedicated hardware. Also, we find that letting accelerators coherently access data from the caches of the host avoids the need for costly cache flushes (and thus reduces the TLB miss-handling latency) but that it is not always beneficial as the achievable bandwidth of coherent accelerator interfaces can be up to 45% lower compared to dedicated DRAM interfaces. Compared to copy-based memory sharing which is still the state of the art for FPGA-based HESoCs, our SVM framework allows for speedups between 1.5 and 12x for purely memory-bound kernels. The remainder of this chapter is as follows. Sec. 6.1 discusses related work. The implementation of our design is presented in Sec. 6.2. In Sec. 6.3, we discuss our results and compare our design with other SVM designs reported in literature.

6.1 Related Work

In the HPC domain, SVM for letting FPGA accelerators themselves orchestrate data transfers from and to main memory has proven beneficial both for performance and programmability [2, 86]. While academic works focus on SVM solutions consisting of TLBs managed in software either running on the host [86] or on a dedicated soft processor core [88], the industry’s approach is that of full-blown, hardware for maximum performance [2–4]. To optimize off-chip bandwidth utilization between host and FPGA, such systems employ FPGA-side data caches, as well as transaction coalescing and reordering [22], which further increases design complexity and leads to considerable resource utilization. For example, IBM’s CAPI utilizes around 25% of the resources provided by a medium-sized, high-end FPGA for data centers [97]. The same design would almost completely occupy even the largest FPGAs found in embedded SoCs, leaving little resources for the actual accelerator only. Clearly, the same level of hardware support is not feasible for embedded systems. Some HPC systems also use private accelerator DRAM to avoid the latency and bandwidth bottleneck between host and FPGA, and they rely on special APIs to allocate pinned host memory for maximum accelerator performance. This complicates application development as pinned memory leads to higher management overheads for the host [22]. 114 CHAPTER 6. SVM FOR FPGA ACCELERATORS

In contrast, FPGA fabrics in embedded SoCs are tighter integrated to the memory system of the host, leading to lower latency and higher bandwidth to main memory. This diminishes the benefits of additional, private DRAM but moves the SVM interface more in the focus for accelerator performance as it is used for all accesses to off-chip memory. Current-generation FPGA-enabled SoCs [55, 92] do not provide SVM support in the form of IP cores and associated driver soft- ware. To enable SVM for FPGA accelerators designed with high-level synthesis (HLS) tools, a recent study [67] proposes an HLS framework extension to provide each shared data element with its private address translation hardware. At offload time, a kernel-level driver on the host locks all memory pages touched by these data elements and creates an optimized translation table for the hardware. While the design allows to tailor the SVM subsystem to the application at hand, it provides only little run-time flexibility and it is not usable when operating on pointer-rich data structures. Another approach relies on OS support to enable SVM using per-thread hardware IOMMUs [89]. The focus of this large and versatile framework primarily lies on providing the application developer a uniform view of software threads executing on the host and hardware threads mapped to FPGA logic. The SVM subsystem is however just a little component inside a large framework. As a consequence, compared to our proposal i) its internals have not been studied in depth and cannot simply be decoupled from the rest of the framework to be used with custom FPGA accelerators; ii) the interaction between hard- and soft-threads is not as streamlined as ours, being it designed on top of less lightweight interfaces such as Portable Operating System Interface (POSIX) threads. Most designs stick to less sophisticated accelerator interaction and memory sharing models [98, 99], where the shared data is placed in contiguous memory using a specific user-space API [66] or by replacing the standard malloc() with a customized implementation [65]. Virtual- to-physical address translation is typically performed explicitly by the host as part of the preparation of a DMA transfer from the contiguous main memory region to the accelerator’s local memories [29, 51, 53]. This model where the host is responsible for the explicit management of the FPGA accelerator memory is not suitable for applications that 6.1. RELATED WORK 115 operate on pointer-rich data structures or perform fine to medium- grained offloads [53]. Moreover, contiguous memory allocation has several other drawbacks, as the Linux implementation has shown [31]. For example, if the kernel must first copy other data out of the pre-allocated, contiguous memory region before it can be used for data shared with the FPGA, a long latency results. Moreover, there is no guarantee that the pre-allocated region can be freed and made available at run time. Finally, CMA returns uncached memory: letting the host operate on such memory is clearly very inefficient. Ideally, a full-fledged hardware IOMMU [28,64]—similar to what is nowadays found in high-end SoCs based on PMCAs [100,101]—provides the required functionality to enable true SVM for custom FPGA accelerators in the embedded systems domain. Opting for maximum performance and complete abstraction of the underlying SVM system, these designs include hardware page-table walker engines, coherent translation caches and large data buffers to absorb memory transactions including DMA transfers missing in the TLB. With the Xilinx Zynq UltraScale+ MPSoC, the next-generation SoC featuring such a hard-macro IOMMU is becoming available [27]. However, embedded SoCs lack in infrastructure software that lets programmers interface their user-space applications with the kernel’s IOMMU API and low-level hardware drivers. In addition, modifi- cations to the ARM-specific implementation of this kernel API are required to let the IOMMU directly operate on the process page table instead of creating an empty I/O page table upon IOMMU initialization, similar to some desktop-class systems with IOMMU support.1 Without the latter, the handling of page faults in this I/O page table, which has to be carried out by software running on the host, can quickly become the major bottleneck. It is currently not foreseen in Zynq UltraScale+ to use the IOMMU for giving an FPGA-accelerator direct access to user-space memory. The IOMMU serves its original purposes of protecting the host from malicious or faulty DMA devices and drivers [24] and of providing the DMA engine with the illusion of a

1ARM has released experimental kernel patches that aim at 1) supporting SVM in ARM-based systems and 2) unifying different AMD-, Intel- and ARM-specific IOMMU API extensions for SVM. However, these only target future ARM IOMMU architecture revisions and are not compatible with current-gen IOMMU devices as the one in Zynq UltraScale+. See: git://linux-arm.org/linux-jpb.git svm/rfc1 116 CHAPTER 6. SVM FOR FPGA ACCELERATORS physically contiguous buffer for higher performance (refer to Sec. 1.3). As such, the host initiates the data movement from and to the FPGA memory through the Linux DMA API [29]: the IOMMU is set up as part every DMA transfer preparation. This is clearly not sufficient to support SVM between host and FPGA accelerator. In addition, some recent research papers have highlighted that state-of-the-art IOMMUs might require modifications to address the needs of a particular target platform and/or application domain. Some point to the fact that a naive IOMMU configuration cannot meet the requirement of today’s high-performance customized accelerators, as it lacks efficient TLB support (no flexibility whatsoever can be expected from such hard-macro IP cores) [87]. The authors propose to leverage the host per-core MMU to implement the required additional functionality, as augmenting the hardware design to that aim would increase the complexity beyond what is affordable (this is especially true for low-end FPGA-based SoCs [102]). Other papers consider the problem of hardware complexity as a showstopper to implementing SVM in heterogeneous, low-end SoCs, and thus propose less intrusive solutions consisting of a simple hardware IOTLB that is completely software-managed by a kernel-level driver on the host [37, 85, 86]. The obvious downside of this approach is that the frequent interactions with the host and the associated interrupt latency can impose considerable overheads at accelerator run-time. Focusing on IOMMU design for FPGA accelerators in embedded systems, a recent paper [80] highlighted that the best performance is achieved by a private, hardware-managed IOTLB. However, the speedup of roughly 12% compared to an IOTLB managed by software running on the host comes at the price of a 106% increase in logic resources with respect to the software-managed IOTLB. While the use of larger TLBs can also help to reduce the TLB service time, especially large and fully-associative TLBs quickly become costly in terms of resources [103, 104]. Alternative options for improving performance of SVM at lower cost in FPGA-based heterogeneous embedded SoCs, such as modifying the architecture of the TLB itself, have not yet been widely addressed by the research community. The SVM framework proposed in this chapter enables true SVM between host and FPGA accelerators by relying on a plug-and-play approach that allows the instantiation of 6.2. INFRASTRUCTURE 117 a lightweight IOMMU in the FPGA logic. The framework does not require modifications to the host architecture nor does restrict the programmer to use specialized memory allocators. Low-level details are handled within a kernel-level driver and runtime library running on the host, without additional overheads at offload time. Our design allows coherent FPGA accelerator access to the data caches of the host as well as TLB prefetching transactions for improved performance. Unlike other designs providing similar capabilities, our solution enables fine-grained control of such features for different address ranges.

6.2 Infrastructure

The FPGA accelerator is interfaced to the shared main memory interconnect through the IOMMU, which translates virtual addresses as seen by the user-space application and the FPGA accelerator to the corresponding physical addresses in main memory. At the heart of any IOMMU design sits an IOTLB. Hard-macro IOMMUs currently employed in high-end SoCs [28,100,101] additionally feature circuitry for TLB management like hardware PTW engines. However, the operation of the PTW itself is bound by the latency to main memory (see Sec. 5.4.3). Moreover, despite some hardware IOMMUs are capable of directly and coherently operating on the host page tables, the Linux IOMMU API and the hardware drivers do not support this on ARM-based host systems [27]. Instead, a separate and empty I/O page table is generated at setup time. The first TLB miss to every page then generates a costly page fault that must be handled in software by the host by mapping the corresponding page to the I/O page table. The hardware management only helps for subsequent TLB misses on pages already mapped. Alternatively, all pages holding the shared data must be mapped at offload time, which is impracticable when operating on pointer-rich data structures. Finally, due to the decoupling of the I/O and the process’ page table, the only way to ensure that the IOMMU does not use stale page-table data at any time is to prevent the mapped pages from being moved by page pinning, which further aggravates the cost for mapping and page fault handling. 118 CHAPTER 6. SVM FOR FPGA ACCELERATORS

For these reasons, we do not consider hardware PTW and rely on a fully software-managed IOMMU design. Our focus in this chapter is on studying parameterizable TLB design that allows for larger TLBs (thereby reducing capacity misses) and higher flexibility. The proposed design is highly configurable and efficiently uses the basic building blocks provided by today’s FPGA devices. It is complemented by a software component for transparent SVM management to ease interfacing with custom accelerators and instantiation on the FPGA. This enables SVM for FPGA accelerators on IOMMU-less SoCs, but it also proves beneficial when a hardware IOMMU is available: 1. as the address-translation hardware can be flexibly tailored to the needs of the application at hand and the granularity at which cache coherence can be controlled can be tuned by the application programmer (with a beneficial effect on performance); 2. as the software management component can be used to control hardware IOMMUs (currently limited to simple operation in accordance to DMA transfers) to comply to the SVM semantics.

6.2.1 IOTLB Design for FPGA Accelerators Fig. 6.1 shows the block diagram of the proposed IOMMU. The FPGA accelerator connects as master device to the slave port of the IOMMU, using the widely adopted AXI4 protocol on all its interfaces [91]. To enable the accelerator’s reaction upon TLB misses (and to repeat the missing transaction once the IOMMU configuration has been updated by the host) we assume a similar accelerator wrapping methodology to [105], where a wrapper core provides the synchronization infrastruc- ture on the accelerator side (See Section Sec. 6.3.1). The VA is fed to the control block together with meta information (length, transaction type and ID, AXI User signals) to perform a look-up of the VA in the TLB. A transaction that hits in the TLB is forwarded to the corresponding physical address in shared main memory. Based on the master select flag stored in the TLB, the transactions are forwarded to either of the two AXI4 master ports. The double data rate (DDR) port directly connects to the DRAM controller of the host CPU. The ACP offered by FPGA-enabled SoCs [55, 92] allows the FPGA to directly access the most recent data copies from the data caches of 6.2. INFRASTRUCTURE 119

Wrapper IOMMU Host Miss Con g FIFOs fi AXI-Lite Slave L1 TLB L2 TLB Meta Ctrl Flags VA PA IRQ Control Master Select AXI4 M VAeta PA Slave ACP AXI4 AXI4 ID DDR Master FPGA Accelerator FPGA Resp HUM Write Read/ Data Write Gen FIFO Response Drop

Figure 6.1: Block diagram of the IOMMU. the host without the need for the operating system to flush the caches at offload time.2 Read or write responses are directly forwarded from the downstream interconnect to the FPGA accelerator. If a transaction misses in the TLB, its VA, ID and the AXI User Signals are stored inside the miss FIFOs, and an interrupt is sent to the host CPU. The host then uses the AXI-Lite interface to read the miss FIFOs, and reconfigure the TLB accordingly (see Sec. 6.2.2). In parallel, the IOMMU drops the transaction and signals a slave error in the AXI Read/Write Response back to the wrapper core inside the FPGA. The IOMMU does not block and can continue to handle address translations from other transactions to shared memory issued by the accelerator. Using the AXI User signals, the accelerator also can mark trans- actions as TLB prefetches. In this case, the IOMMU performs the look-up in the TLB and signals back the result to the accelerator using the AXI Read/Write Response signals. However, the transaction is not forwarded to shared main memory in both the miss and the hit case. This allows the accelerator to request the setup of the TLB ahead of time without paying the latency to shared memory. All AXI interfaces support configurable data and address widths. Also, the number of AXI4 ports is configurable, each of them having private TLBs.

2Next-gen SoCs [27] replace the ACP with more advanced ports supporting the ACE-Lite coherency extensions of the AXI4 protocol. Our IOMMU design is also compatible with ACE-Lite-enabled SoCs. 120 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Flexible, Single-Cycle L1 TLB The first TLB design is optimized for low look-up latency and high flexibility. To this end, it is implemented using LUT and register slices of the FPGA [106] instead of BRAM hard macros [107]. It has a look-up latency of 1 clock cycle, is fully-associative and allows for arbitrary-sized mappings, i.e., multiple memory pages can be remapped using a single TLB entry if they are contiguous in virtual as well as in physical memory. This allows to efficiently support techniques targeting at reducing the host TLB miss rate such as transparent huge pages [108] and contiguous physical memory obtained, e.g., through the contiguous memory allocator (CMA) [30]. Low TLB look-up latency and high associativity are important whenever virtual-to-physical address translation is in the critical path, such as for CPUs where every memory transaction has to be translated. However, such designs do not scale well to larger sizes and quickly become very costly in terms of resources [103, 104]. In practice, they are limited to sizes up to 64 entries.

Set-Associative, Multi-Cycle L2 TLB To overcome the limitations of traditional fully-parallel, fully-associative FPGA TLB designs [64, 88, 104], we propose a new and scalable TLB architecture for FPGAs. It relies on sequentially searched BRAM to allow for high-capacity TLBs and thus reduced overall TLB service time at reasonable FPGA resource cost. To achieve look-up latencies comparable to fully-parallel, pipelined TLB designs, our architecture 1) is n-way set associative, 2) parallelizes the look-up using multiple dual-port BRAM cells, and 3) starts the look-up at the position of the last TLB hit, which brings the effective look-up latency down to the minimum of 3 cycles for bandwidth critical workloads. The maximum look-up latency is

nways Lmax = 2 + 2 · nRAMs where nRAMs is the number of BRAMs searched in parallel. Typically, Lmax is between 4 to 18 clock cycles. How the sets are distributed over the parallel BRAMs is shown in Fig. 6.2 a). The mapping of VAs to sets is based on the least significant bits (LSBs) of the page 6.2. INFRASTRUCTURE 121

VA RAM 0 Input VA Input Con g Output PA Ctrl Flags a) Entry 0 Entry 0 b) fi Set 0 Entry 1 Entry 4 Entry 2 Control Entry E-4 Set S-1 Set 0 Entry 3 VA RAM 1 Search Search Port 0 Port 1 Entry 1 Set 0 0 N-1 Entry 5 Set 0

Entry E-3 Set S-1 Set 1 VA RAM 2 Port 0 Port 1 Port 0 Port 1 Entry 2 Set 0 Set 0 Set 0 Set 1 Set 1 Entry E-2 Set S-1 Set Entry E-4 VA RAM 3 Set S-1 S-1 Entry 3 Set S-1 Set S-1 Entry E-3 Set 0 Entry E-2 VA RAM 0 VA RAM N-1 PA RAM Entry E-1 Entry E-1 Set S-1 L2 TLB

Figure 6.2: a) Mapping of TLB entries to 4 VA RAMs and b) proposed L2 TLB architecture using dual-port BRAMs. frame number (PFN). In contrast to the flexible L1 TLB, this design is restricted to page-sized mappings, but its architecture is highly configurable at compile time. The configuration parameters include the number of sets, the number of ways (set size, associativity), the number of BRAMs searched simultaneously during the look-up procedure, and the page size. The architecture of this TLB is visualized in Fig. 6.2 b). Upon receiving an input VA and a start signal from the top-level control block of the IOMMU, the set index is determined and forwarded to the parallel search units together with the virtual PFN. Every search unit is connected to a dual-port BRAM cell holding the VAs (VA RAMs). Per cycle, every search unit reads two entries of the current set from the VA RAM (using both ports of the BRAM cell), and compares the PFNs stored in these entries with the PFN of the input VA. If an entry is found that allows the requested transaction type, the index of the matching entry is forwarded to the control block together with a hit signal. The control block aborts the search, uses the offset of the matching entry to read the corresponding physical PFN from the single physical address (PA) RAM, and then outputs the PA together with control flags to the top-level control block. The offset of the matching entry is stored in a set-specific register in the control unit. The next search within a set is started at the same offset of the last hit which allows to speed up the search and exploit possible locality of reference in the access pattern to shared memory. 122 CHAPTER 6. SVM FOR FPGA ACCELERATORS

In case the entire set is searched without finding a valid entry or if a valid entry is found that does not allow the required transaction type, the search is aborted and a corresponding signal is sent back to the top-level control block. The configuration of the VA RAMs is performed through Port 1.

Hybrid TLB A real system can benefit from a combination of the two TLB designs to form a hybrid architecture. A small L1 TLB with 4 to 8 variable- sized entries, e.g., to efficiently support techniques designed to reduce host TLB misses such as CMA, and a large L2 TLB with higher look-up latency and page-sized but much cheaper entries for regularly allocated memory pages. Thus, our design supports the instantiation of both TLB blocks. To allow for parallel look-ups of the two TLBs in the hybrid architecture, our design supports hit under miss (HUM), which can lead to transaction reordering. To ensure correct ordering in both the AXI Write Address and Write Data channels, typical hardware IOMMUs [28] buffer the entire write data burst, which can require a considerable amount of buffer memory (an AXI4 burst can comprise up to 256 data beats). Instead, our IOMMU just uses a FIFO with the size equal to the maximum look-up latency of the L2 TLB (hit under miss (HUM) FIFO in Fig. 6.1). As visualized in Fig. 6.3 a), for a missing transaction AW0 with a burst length lower than or equal to the L2 TLB latency, the data W0 is buffered by the HUM FIFO. In the meantime, Transaction AW1 hits in the L1 TLB and is fed to the output. Once the associated data W1 arrives at the input, it bypasses the HUM FIFO and is directly fed to the output. When the physical address for AW0 has been found in the L2 TLB, the transaction is fed to the output and data W0 is retrieved from the FIFO. If the burst length is larger than the L2 TLB latency, there is no reordering as shown in Fig. 6.3 b). The data W0 is fed to the HUM FIFO and Transaction AW1 hits in the L1 TLB. Since the data W1 is not yet available at the input, AW1 is not sent to the output. Once the HUM FIFO is full, the write data input is stalled. When the physical address for AW0 has been found in the L2 TLB, it can be forwarded to the output together with W0. The remaining data beats of W0 pass 6.2. INFRASTRUCTURE 123

a) HUM FIFO Hit Miss Time

Adress Write In AWW 0 A 1

Write Data In W0 W1

Bypass HUM FIFO Retrieve from HUM FIFO L1 TLB Look-up AWW 0 A 1

L2 TLB Look-up AW0

Adress Write Out AW1 AW0

Write Data Out W1 W0 Transaction Reordering Same Transaction Order in Address and Data Channel b)

Adress Write In AW0 AW1 Stall Address Write Input AW2

Write Data In W0 W1 W2 Stall Write Data Input when HUM FIFO full

L1 TLB Look-up AW0 AW1 AW2

L2 TLB Look-up AW0

Adress Write Out AW0 AW1 AW2

Write Data Out W0 W1

No Transaction Reordering Start Reading from HUM FIFO

Figure 6.3: HUM support in the write channel for burst lengths a) lower and b) greater than the max look-up latency of the L2 TLB.

through the HUM FIFO to the output. Transaction AW1 can finally be sent out. Even if the HUM FIFO was sufficiently large to buffer the entire W0, AW1 could not be sent out faster, as AXI4 does not allow for write-data interleaving and the associated data W1 becomes available only after W0. If a transaction misses in the L1 TLB while the L2 TLB is performing a look-up, the input address channels are stalled until the L2 TLB becomes available again. Note that the proposed SVM framework does not block the accel- erator’s traffic to shared memory in case of outstanding TLB misses, independently of the selected TLB architecture and organization. The purpose of the HUM feature is just to allow for parallel look-ups if both TLBs are being instantiated inside the IOMMU.

6.2.2 SVM Management Fig. 6.4 a) visualizes how the hardware and the software layers of the framework interact. Once the application that needs to be accelerated is started, the runtime instructs the driver module to register a kernel worker thread for handling TLB misses with the CMW API of Linux using a system call as shown in Fig. 6.4 b). The shared data elements 124 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Abstraction SVM a) b) User Runtime Runtime Layer: User-Space Application Framework Time Space Start Library malloc() ... Library User-Level Runtime Lib Host Driver initialization. Software Pass virtual-address get_user_pages() Kernel Process pointer. Space Context Driver Set up Linux Driver Module TLB entry. Kernel- Kernel Register worker Schedule Worker Worker Signal Level thread with CWM. worker thread. IOMMU API ready. Software get_user CMW _pages() IOMMU Read Driver Interrupt TLB-miss IRQ Handler Context Miss interrupt FIFO. Hard Soft Hardware Host Processor IOMMU IOMMU FPGA FPGA Accelerator Accelerator-initiated DMA & Kernel

Figure 6.4: a) Hierarchical overview of the SVM framework and b) interaction of the components in operation. can be allocated in memory by the user-space application as any other variable using the system’s standard malloc() function. This allows for simple development and porting of already existing applications. When offloading computation to the FPGA accelerator, the appli- cation developer just needs to specify the virtual address of the shared data elements which is then communicated to the accelerator, e.g., through a memory mapped mailbox. The accelerator can then access the shared data element using the virtual address pointer obtained from the runtime. In the case of a TLB miss, the interrupt handler inside the driver module simply triggers the execution of a miss-handling thread in normal process context. Once the miss-handling thread gets scheduled, it first reads the address and transaction attributes from the miss-handling FIFOs in the IOMMU. Then, it locks the requested user-space page in memory using get user pages(), performs the virtual-to-physical address translation, and sets up a new entry in the TLB. If all TLB entries are in use, the oldest mapping is invalidated and the corresponding user-space page is unlocked (FIFO replacement). Based on the transaction attributes, the miss-handling thread then sends a signal to the accelerator which can finally repeat the transaction that previously missed in the TLB (see Sec. 4.2.2). In the case of a TLB miss that resulted from a prefetching transac- tion (marked in the AXI User signals and in the transaction attributes stored in the miss-handling FIFOs), no signal needs to be sent to the accelerator and the miss-handling thread can directly continue handling misses until the miss-handling FIFOs are empty. The use of prefetching transactions allows the accelerator to request the setup of 6.3. EXPERIMENTAL RESULTS 125 multiple TLB entries with a single TLB-miss interrupt, for example before setting up a DMA transfer touching multiple memory pages. This can improve performance as the miss-handling thread can handle multiple TLB misses in batch mode and thus needs to be scheduled at most once, similar to the page fault handler of some IOMMUs found in high-end SoCs [109]. In addition, the runtime library allows the application developer to specify virtual address ranges and associate them with a TLB and an IOMMU master port at offload time. The miss-handling thread then sets up new TLB entries within the specified range according to these settings. The library also allows for setting up the TLB statically at offload time and for locking the corresponding TLB entries on a basis of shared data elements, if they can be completely remapped with the available TLB entries. Concretely, this allows fine-grained control of which address ranges (i.e., program data items) are to be looked up in the host cache (through the ACP) and which in the DRAM. Similarly, it is possible to pin address ranges to what is mapped to the L1 or the L2 TLB. To this end, the runtime only needs the virtual address, size and access permissions, similar to the computation offloading directives of today’s programming models for heterogeneous systems [26, 81].

6.3 Experimental Results

In Sec. 6.3.1, we discuss the platform and the benchmarks used to eval- uate the application performance of the proposed SVM framework. We evaluate the FPGA resource utilization and maximum clock speed of the design for different TLB configurations in Sec. 6.3.2. In Sec. 6.3.3, we discuss the cost of various primitives of our framework profiled using microbenchmarks. The performance of the proposed framework is evaluated using real application kernels, including such operating on pointer-rich data structures, in Sec. 6.3.4. Finally, we compare our design with other works reported in literature in Sec. 6.3.5.

6.3.1 Evaluation Platform To evaluate the performance of our SVM framework and explore the different TLB designs in a real system, we set up an evaluation 126 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Host Scratchpad Memory FPGA Bank Bank Bank Bank ARM A9 ARM A9 Accelerator MMU MMU Interconnect L1 $ L1 $ IOMMU Soft ACP Core Accel Accel Accel Coherent Interconnect L1 TLB Engine Engine Engine 1 2 8 L2 TLB DMA

L2 $ Interconnect Engine

Xilinx Zynq-7000 SoC DDR DRAM Controller DDR

DDR DRAM

Figure 6.5: Evaluation platform with FPGA accelerator featuring eight parallel engines, SPM, DMA engine and soft core for control. platform based on the Xilinx Zynq-7000 SoC [55]. An overview of this platform is shown in Fig. 6.5. The programmable system of the Zynq SoC is used as the host system, with its dual-core ARM Cortex-A9 CPU running Xilinx Linux 3.18. The coherent interconnect features an additional slave port (i.e., the ACP), which allows hardware blocks implemented in the programmable logic of the SoC to access the data caches inside the CPU. In case the requested data is not held in cache, the transaction is sent to main memory using the high-priority port of the Zynq’s DRAM controller. The system features 1 GiB of DDR3 DRAM clocked at 600 MHz, which is shared between the host and the FPGA. To interface the programmable logic of the Zynq SoC with the host, we instantiate an IOMMU configuration with two ports. The first port is used for host-to-accelerator communication and uses an L1 TLB with 4 entries only. The second port is used for accelerator-to-shared- memory communication. It features two master ports: One connected to the DDR DRAM controller of the host and one connected to the ACP. Also, it is used to instantiate both TLB designs. The L1 TLB is instantiated with 32 entries, i.e., the maximum size for which clock frequencies above 100 MHz are achievable (see Sec. 6.3.2). The L2 TLB is instantiated with 4 parallel VA RAMs with 32 sets and 1024 entries in total, and a max look-up latency of 6 cycles. The host runs at 666 MHz and the programmable logic at 100 MHz. Tbl. 6.1 gives the absolute and relative FPGA resource utilization of the IOMMU configuration used for the evaluation. The selected 6.3. EXPERIMENTAL RESULTS 127

Table 6.1: FPGA resource utilization of the IOMMU configuration selected for the experimental evaluation.

Sub Block LUTs FFs BRAM L1 TLBs a 6.63 k 3.03% 4.67 k 1.07% 0 kbit 0.00% L2 TLB b 0.29 k 0.13% 0.14 k 0.03% 45.05 kbit 0.24% Buffers & Control 1.71 k 0.78% 2.75 k 0.63% 1.09 kbit 0.01% Total 8.63 k 3.94% 7.56 k 1.73% 46.14 kbit 0.25% a 32 entries (accelerator to host), 4 entries (host to accelerator) b 1024 entries (4 VA RAMs with 32 sets, accelerator to host) configuration uses less than 4% of the available resources and leaves plenty of space for actual accelerators.3 Attached to the IOMMU, there is a programmable FPGA accel- erator [105]. Eight accelerator engines can directly access a local L1 SPM (256 KiB) and the main memory via both single-word loads/stores and more efficient DMA burst accesses. A soft core is in charge of higher-level control tasks. These include the configuration of the accelerator engines, the management of DMA transfers between SPM and main memory, and the synchronization with the host. This core can also be used to ensure that the corresponding TLB entries are set up in the IOMMU. This is achieved by issuing prefetching read accesses to the memory pages touched by subsequent DMA transfers and evaluating the read response returned by the IOMMU (see Sec. 4.2). In the case of a miss, it goes to sleep and waits for the host to set up the TLB. After handling the miss, the host wakes up the soft core that generated the miss. To avoid cache pollution by the accelerator when accessing the shared memory through the ACP, DMA transactions are configured to not be allocated in the data caches of the host in the case of cache misses. The maximum DMA burst size is 256 B, and the accelerator’s peak bandwidth to main memory is 6.4 Gb/s. The accelerator engines generate the memory access patterns of four real applications, which have been parameterized to capture the effect of varied input datasets. In particular, we have run the PC, RFT

3For example, the implementation of an accelerator for sparse matrix-vector multiplication [110] would use roughly 60% of the LUTs and FFs, and 57% of the BRAMs offered by the same FPGA device. 128 CHAPTER 6. SVM FOR FPGA ACCELERATORS and MC kernels described in Sec. 5.4.2. Due to their irregular memory access patterns that are data dependent, and since they feature low locality of reference, PC and RFT represent worst-case scenarios for virtual memory systems. However, such kernels are at the basis of a wide variety of applications effectively accelerated using FPGAs thanks to their high degrees of parallelism [111, 112]. MC is a representative example for streaming applications with low operational intensity. With sparse matrix-vector multiplication (SMVM), we have imple- mented a fourth benchmark application typically amenable to FPGA acceleration. The host encodes a potentially very large, sparse matrix in Condensed Interleaved Sparse Representation (CISR) [110] and then passes virtual address pointers to the matrix as well as to the input and output vector to the accelerator. The CISR format statically schedules the computation to 8 parallel accelerator pipelines on a row basis, thereby allowing for load balancing at low communication overhead and resource cost [113]. The accelerator uses DMA transfers to first fetch the dense input vector and then continuously streams in the sparse matrix.

6.3.2 FPGA Resource Utilization To evaluate the resource utilization and maximum clock frequency of the TLB designs we have implemented various configurations for a XC7Z045FFG900-3 device using Xilinx Vivado 2016.3. The system uses 32-bit address width, 64-bit data width and a page size of 4 KiB. Fig. 6.6 compares the resource utilization and the TLB look-up time for different configurations of the two TLB designs when targeting maximum clock speed. For the flexible, single-cycle L1 TLB design we varied the number of entries. The resource utilization of the L1 TLB increases linearly with the number of entries. The same holds for the longest path delay, as the comparator network for performing the single-cycle TLB look-up does not map well to FPGA logic. The multi-cycle L2 TLB design outperforms the L1 TLB in terms of resource utilization and longest path delay. For example, using the L2 TLB design to build a fully-associative TLB with 32 entries allows to reduce the logic resource utilization by more than 17x compared to the L1 TLB 1 . This comes at the price of a larger and variable TLB look-up latency and the restriction to page-sized mappings. The 6.3. EXPERIMENTAL RESULTS 129

1500 64 Entries 2570 32 Entries Flexible, Single-Cycle L1 TLB 1200 Set-Associative, Multi-Cycle L2 TLB (1024 Entries) 900 1 17x Lower Resoure Utilization 16 Entries 600 Fully-Associative,

[Logic Slices] [Logic 8 Entries Multi-Cycle L2 TLB (8 RAMs, 32 Entries)

Resource Utilization Resource 300 4 RAMs Min Max

2 RAMs 49

4 Entries 1 RAM 8 RAMs 2 32x Higher TLB Capacity 0 0 10 20 30 40 83 TLB Look-up Time [ns]

Figure 6.6: Logic resource utilization vs. look-up time for the two TLB designs when optimizing for maximum speed. minimum look-up latency for this TLB is 3 clock cycles whereas the maximum latency depends on the number of parallel VA RAMs. The longest path runs from the output of the VA RAMs through the search units into the PA RAM. Adding a local pipeline stage would further reduce the longest path delay. The main advantage of this design is that it enables the construction of high-capacity TLBs at low resource cost. As shown in Fig. 6.6, switching from a fully-associative to a 32-way set-associative configuration allows for a 32x increase in TLB capacity at an increase in logic resource utilization and longest path delay of just 20% and 4%, respectively 2 . By varying the number of parallel VA RAMs, the design of the L2 TLB allows for a trade-off between maximum look-up latency and resource utilization while keeping the number of entries constant. This trade-off is visualized in Fig. 6.7 a) for the same configuration with 1024 entries divided into 32 sets. As shown in the figure, the number of slice LUTs increases logarithmically with the number of VA RAMs whereas the number of slice FFs decreases with more VA RAMs. The reason for this decrease in FFs is that as the number parallel VA RAMs increases, the number of bits required to store the offset of the last hit decreases. Using 4 parallel VA RAMs offers a good trade-off between maximum look-up latency of 6 cycles and logic resource utilization. Note that, as the number of parallel VA RAMs increases from 1 to 4, 130 CHAPTER 6. SVM FOR FPGA ACCELERATORS

a) 400 20 b) (#Sets x # Ways, Max Latency) 18 2048 Entries 350 400 LUTs 1024 Entries 16 (64 x 32, 6) 512 Entries 300 14 350 256 Entries 12 (32 x 64, 10) 250

10 LUTs Max Look-up Latency 300 (32 x 32, 6) 200 8 (32 x 16, 4)

Resource Utilization Resource FFs 6 250 150 4 (16 x 32, 6)

Max Look-up Latency [Cycles] Latency Look-up Max (8 x 32, 6) 100 2 200 1 2 4 8 100 125 150 175 200 #VA RAMs FFs

Figure 6.7: Resource utilization and max look-up latency of the set- associative, multi-cycle L2 TLB design for different a) numbers of parallel VA RAMs and b) sizes and associativity with 4 VA RAMs. the utilization of the individual BRAM cells decreases by a factor of 4x from 75% to 19%.4 How the logic resource utilization with 4 VA RAMs changes with the number of sets and the number of ways (set size), is shown in Fig. 6.7 b). The configuration using 32 sets and 32 ways (1024 entries) offers a good trade-off. For TLB sizes greater than 1024 entries, either the logic resource utilization or the maximum look-up latency increases sharply.

6.3.3 Microbenchmarking and Profiling We have evaluated the cost of the software primitives of our SVM system using a synthetic benchmark. The host passes a pointer to an array allocated in virtual memory to the accelerator. The accelerator reads the pointer and performs accesses to the array. Using performance counters both inside the host and the accelerator, the duration of the various phases during miss handling can be profiled. The results are shown in Tbl. 6.2. The average cost for handling a TLB miss is 28,300 cycles when the accelerator accesses the data from DRAM. Using the ACP avoids the need to flush the data caches of the host and allows for 18% faster miss handling. Most of the miss-handling time is spent on waiting for

4Xilinx 7 Series devices feature BRAM cells with fixed dimensions of 16 or 32×1024 bit (not including parity bits) [55]. 6.3. EXPERIMENTAL RESULTS 131

Table 6.2: Average delay in host cycles for sharing a single 4 KiB memory page.

Response Schedule Handling Total TLB Miss, DDR 4,400 9,900 14,000 28,300 TLB Miss, ACP 4,400 9,900 8,900 23,200 Copy Page Out to Shared Memory 43,500 Copy Page In to Virtual Memory 87,500 Copy Page From SVM to SPM, DMA Read 5,100 Copy Page From SPM to SVM, DMA Write 4,400 the kernel worker thread to get scheduled (51 or 62% when accessing through the DDR port or ACP, respectively). This is required as the routine walking the page table and pinning the pages uses a kernel API which may sleep and, therefore, cannot be executed in interrupt context. These numbers are in line with related work. For example, responding to an interrupt was reported to take 5,300 to 13,300 host cycles on the same SoC [67]. However, switching to a different platform can have a high impact. Handling a page fault and pinning the page in memory on a low-end PowerPC 405 embedded processor takes 391,000 cycles [89]. Handling a miss in a software-managed TLB attached to an Intel Core i5 processor over PCIe takes 124,000 cycles [86]. Copying a single 4 KiB memory page of raw data from virtual memory to the physically contiguous, uncached memory reserved for the accelerator in case the platform does not support SVM takes 43,500 host cycles. Copying back a page that the accelerator has modified costs at least 87,500 cycles. While the implementation of a routine to copy raw data is straight forward, the traversal and translation step in case the shared data contains virtual address pointers is completely application dependent and left to the programmer. The associated design time overheads are huge and usually require the programmer to completely rethink and restructure the application. Also, the traversal and translation step incurs a big run-time overhead at offload time, which is much larger than that of copying raw data. However, once the costly offload including the data copying has been done, the accelerator’s accesses to the shared data come at no overhead. In contrast, the cost for setting 132 CHAPTER 6. SVM FOR FPGA ACCELERATORS

a) b) 6 Interconnect Peak Bandwidth = 6.4 Gb/s 6 Interconnect Peak Bandwidth = 6.4 Gb/s

5 5

4 4

3 3

2 2

1 1 DMA Write Bandwidth [Gb/s] DMA Write DMA Read Bandwidth [Gb/s] DMA Read 0 0 DDR ACP Hit ACP Miss DDR ACP Hit ACP Miss L1, 32 Entries, 1 Cyclelee L2, 1024 Entries, 6 Cycleses L2, 2048 Entries, 10 Cycles

Figure 6.8: DMA bandwidth to SVM for a) read and b) write transfers when using different TLB configurations. up a TLB entry have to be paid on every miss to a particular page. Thus, if data reuse is high, copy-shared memory can out-perform SVM. This will be highlighted by the exploration in Sec. 6.3.4. For maximum performance, the accelerator operates on data re- siding in its local SPM. This memory is physically addressed and managed by the accelerator itself, which uses high-bandwidth DMA transfers to copy data between SVM and the SPM. The latency of setting up a single DMA transfer is 6 accelerator-clock cycles. Tbl. 6.2 also lists the full cost for transferring a single 4 KiB memory page between SVM and the SPM using the DMA engine. To evaluate possible effects of the TLB configuration and look- up latency on the maximum DMA bandwidth, we used another synthetic benchmark that lets the accelerator issue DMA transfers with the maximum size of 32 KiB. The TLB is statically set up for this measurement and the host is idle during the measurement. As shown in Fig. 6.8 a), the highest bandwidth for read transfers (90% of the peak bandwidth of the interconnect) is achieved when using the ACP and if the requested data is in the data caches of the host (ACP hit). If the data is not in the caches (ACP miss), the effective bandwidth is substantially reduced as the data has to be retrieved from DRAM using the same physical interconnect used by the host. In case the host is heavily loaded, the traffic injected by the accelerator can lead to additional contention and performance degradation. Compared to using the DDR port, the achievable bandwidth is roughly 45% lower. Letting the FPGA coherently access data from 6.3. EXPERIMENTAL RESULTS 133 the caches of the host is not always beneficial. For write transfers, the maximum bandwidth is between 85 and 90% of the peak bandwidth as shown in Fig. 6.8 b). The TLB configuration and maximum look-up latency does not have an impact on the maximum read bandwidth. This is a result of the proposed L2 TLB design starting the look-up at the position of the last TLB hit.

6.3.4 Real Traffic Patterns

In this section we evaluate the performance of the SVM framework and the various TLB designs (L1, L2, hybrid) using the 8 accelerator engines, under various workload conditions listed in Tbl. 6.3 and Tbl. 6.4. The selected parameters represent a mix of real use cases [83, 93, 114, 115], problem sizes suitable for embedded systems and performance crossover points of the system. For pointer chasing (PC), we used an Erd˝os-R´enyi graph [116] with 10 k vertices (input data is randomized). During our experiments, we found that graphs with different vertex count do not lead to big differences in the results as long as the total graph size is larger than the TLB capacity. Random forest traversal (RFT) operates on random input numbers which are sorted by the tree (the number of input samples is typically in the order of thousands to millions). Therefore, the obtained access patterns to shared main memory are highly irregular and randomized, and the benchmarks represent a worst-case scenario for an SVM sub-system. Comparing with copy-based shared memory, data reuse matters. Therefore, we varied the number of input samples fed to RFT as well as the number of iterations performed in PC. The number of iterations performed varies between 1 and 6 for most applications, while some require up to 70 [93]. In many cases, the execution is stopped after 5 iterations, where 95% of the vertices have converged [115]. For memory copy (MC) we consider as main parameters the size of the data chunks (64 KiB and 1024 KiB) and the number of compute iterations executed on the offloaded data (data reuse). For sparse matrix-vector multiplication (SMVM), we have used four matrices with 134 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Table 6.3: Parameter Sets for MC, PC and RFT benchmarks.

Memory copy #Iterations 1 - 64 Data Size [KiB] 64 - 1,024

Pointer chasing Random forest traversal #Vertices 10 k #Tree Levels 4 - 16 Vertex Size [B] 44 - 2,060 Vertex Size [B] 28 - 268 #Cycles per V. 10 - 10,000 #Cycles per V. 10 - 1,000 #Iterations 1 - 64 #Input Samples 256 - 2,048

Table 6.4: Matrices used for SMVM [117].

Matrix Name #Rows #Cols #Non-Zero Entries Data Size [KiB] power 4,941 4,941 13,188 180 ca-HepTh 9,877 9,877 51,996 561 Dubcova1 16,129 16,129 269,138 2,355 olafu 16,146 16,146 1,031,302 8,309 different sizes and numbers of non-zero entries from the University of Florida Sparse Matrix Collection [117], as shown in Tbl. 6.4. We have measured the accelerator run time (including offloading time) for the different schemes and normalized results to the copy-based memory sharing, the state of the art in embedded heterogeneous systems. The accelerator uses prefetching transactions before setting up DMA transfers touching multiple pages. This improves performance as the host needs to handle at most one TLB-miss interrupt and schedule the miss-handling thread at most once per DMA. With our design, the application developer has fine-grained control of the IOMMU settings on the basis of virtual address ranges and/or data elements. For example, this allows to associate a specific address range with the L2 TLB and the DDR port for maximum bandwidth, while the ACP is used for other shared data elements. As MC and PC both share a single data element only, we have evaluated the different settings in isolation without the hybrid TLB configuration for these two benchmarks. 6.3. EXPERIMENTAL RESULTS 135

In contrast, the data sizes for RFT and sparse matrix-vector multiplication (SMVM) quickly exceed the size of the data cache of the host. This reduces the efficiency of DMA transfers on the ACP. Thus, we always used the DDR port only for these two benchmarks.

Pointer Chasing (PC)

Fig. 6.9 shows the performance of PC (different curves represent different TLB schemes), normalized to the performance of the baseline copy-based shared memory. The x-axes of the plots show the total graph size when increasing the vertex size from 44 B to 2 KiB. The two upper plots refer to a configuration where for each vertex loaded from memory, 10 cycles are spent on computation. For the two lower plots, the computation cycles per vertex are 10 k. Typical PC applications perform 1 to 6 iterations on the graph [93, 115], thus we report results for 1 iteration (plots a and c) and 4 iterations (plots b and d). For a single iteration, using the L2 TLB allows to achieve speedups of up to 2.2x. L2 DDR performs better for graph sizes below 4 MiB, i.e., when the entire graph can be remapped by the L2 TLB and the execution is thus bandwidth limited. As the graph size increases, the number of TLB misses increases. The execution time becomes more dominated by handling capacity misses in the TLB, while the ACP performs better as no cache flushes are required. As the number of iterations on the graph increases, the offload cycles in the case of copy-based shared memory are less predominant. The relative performance of the SVM schemes deteriorates for graph sizes above the TLB capacity. As the number of compute cycles per vertex increases to 10 k, the relative speedup when using the L2 TLB decreases as shown in Fig. 6.9 c) and d), respectively. Due to the increased operational intensity, the time spent for memory transfers accounts for a smaller portion of the total run time. For larger vertex and graph sizes, more computation cycles help to overlap the miss-handling time with useful computation. In contrast, copy-based shared memory does not allow to overlap the costly offload procedure with accelerator computations. The relative performance saturates at around 60% even when using the much smaller L1 TLB. 136 CHAPTER 6. SVM FOR FPGA ACCELERATORS

a) #Iterations = 1 b) #Iterations = 4

Bandwidth dominated Miss Graph Size = 2.2 dominated L2 DDR L2 Capacity 2 L2 ACP 1.8 1.6 2.2x Speedup L2 DDR 1.4 Graph S ze = 1.2 L2 Capac ty Copy-Based Copy-Based 1 L1 ACP 0.8 L2 ACP 0.6 Performance Normalized to Normalized Performance

Copy-Based Shared Memory Shared Copy-Based 0.4 L1 DDR L1 ACP 0.2 L1 DDR

0.4 1.4 5.0 20.0 0.4 1.4 5.0 20.0 Graph Size [MiB] Graph Size [MiB]

c) #Iterations = 1 d) #Iterations = 4

2.2 2 Overlap Compute and 1.8 TLB Setup 1.6 L2 ACP 1.4 1.2 L2 DDR L2 DDR L2 ACP 1 Copy-Based L1 ACP Copy-Based 0.8 L1 ACP L1 DDR 0.6 L1 DDR Performance Normalized to Normalized Performance Copy-Based Shared Memory Shared Copy-Based 0.4 Graph Size = Graph Size = L2 Capacity L2 Capacity 0.2

0.4 1.4 5.0 20.0 0.4 1.4 5.0 20.0 Graph Size [MiB] Graph Size [MiB]

Figure 6.9: PC performance for varying vertex/graph size, number of iterations: 10 cycles per vertex in a) and b), 10 k cycles in c) and d).

It is important to note that, while SVM can perform worse than copy-based memory sharing in some scenarios, its programmability is always much better. This is particularly true for PC, where the copy- based approach required a complete rewrite of the original program.

Random Forest Traversal (RFT) Fig. 6.10 a) and b) plot the performance versus increasing tree depth for different vertex sizes. Since during the offload of this application only 6.3. EXPERIMENTAL RESULTS 137 few pointers need to be adjusted and since the accelerator only performs read accesses to the data structure, already little data reuse amortizes the offload cost. In contrast, the access pattern to the tree is highly irregular. This creates many capacity misses when using the L1 TLB. Due to the irregular pattern and the small vertex size, there is little temporal locality, which unveils the higher maximum look-up latency of the L2 TLB in a). The relative performance increases for growing three depths. On one hand, the higher look-up latency of the L2 TLB is compensated by the higher overall hit rate. On the other hand, a lot of data is copied that

a) Vertex Size = 28 B, #Samples = 256 b) Vertex Size = 268 B, #Samples = 256 4.0 Tree Size = 12 12x Speedup L1 Capacity 3.5 10 3.0 Hybrid I=0.2 C/B 2.5 8 Hybrid I=0.2 C/B

2.0 6

1.5 L2 I=20 C/B 4 Copy-Based 1.0 L2 I=0.2 C/B

Performance Normalized to Normalized Performance Memory Shared Copy-Based Increasing L2 I=0.2 C/B 2 0.5 Copy-Based Operational L1 I=0.2 C/B 1 Intensity L1 I=0.2 C/B 0.0 0 4 6 8 10 12 14 16 4 6 8 10 12 14 16 #Levels #Levels

c) Vertex Size = 28 B, #Levels = 16 d) Vertex Size = 268 B, #Levels = 16 4.0 12 12x Speedup 3.5 Hybrid I=0.2 C/B 10 3.0 Hybrid I=0.2 C/B 2.5 8

2.0 6

1.5 L2 I=0.2 C/B 4 1.0 L2 I=0.2 C/B Copy-Based 2 Performance Normalized to Normalized Performance Memory Shared Copy-Based 0.5 L1 I=0.2 C/B L1 I=0.2 C/B 1 Copy-Based 0.0 0 256 512 1024 2048 256 512 1024 2048 #Samples #Samples

Figure 6.10: RFT performance for different operational intensities in cycles per Byte (C/B) when varying the tree depth a) and b), and when varying the number of samples fed to the tree c) and d). 138 CHAPTER 6. SVM FOR FPGA ACCELERATORS is potentially never accessed by the accelerator when using copy-based shared memory. A larger operational intensity brings the performance closer to that of copy-based memory sharing for two reasons. First, it causes the total run time to be less dominated by memory transfers; second, it allows to overlap useful computation with the handling of TLB misses. Larger vertex sizes let SVM become more beneficial as they amortize the miss-handling cost (more data is accessed per miss). This effect increases with the tree depth as on the lower levels, almost every access to the tree is a miss. This can lead to very high speedups of up to 12x as shown in b). The speedup decreases as the number of samples fed to the tree increases as shown in Fig. 6.10 c) and d). The highest performance is achieved when using the hybrid design and allocating the trees in physically contiguous memory. Accesses to the trees then always hit in the L1 TLB. However, the hybrid design is not suitable for on-line learning [114] as the host needs to regularly access the physically contiguous, uncached memory to update the trees. In this case, the L2 TLB only should be used. It performs better for up to 1536 samples (tree updates usually have to performed much earlier).

Memory Copy (MC) Fig. 6.11 shows the performance of MC normalized to copy-based shared memory for a data size of 64 KiB and 1024 KiB. The maximum speedup is 2.2x (L2 ACP). As the size of the SPM might not suffice to hold all the required data for a specific accelerator kernel, multiple iterations over the same input data might be required, e.g., to apply a whole set of different filter kernels on a given input image. The x-axes of the plots denote the number of iterations performed on the data. There are multiple reasons for the drop in relative performance with increasing number of iterations. First, in the case of copy-based shared memory, more accesses to the copied data amortize the initial offload cost. Second, with more iterations, the execution time becomes less dominated by handling TLB misses but by the effective main memory bandwidth if the data size is below the TLB capacity. Using the ACP should be avoided in this case due to the lower bandwidth resulting from cache misses and contention. Fig. 6.11 a) shows that 6.3. EXPERIMENTAL RESULTS 139

a) Data Size = 64 KiB b) Data Size = 1024 KiB Miss dominated 2.2 Miss dominated Bandwidth dominated 2 2.2x Speedup Bandwidth dominated 1.8 L2 DDR 1.6 always L2 DDR fastest 1.4 L1 DDR L1 does not outperform L2! L2 DDR 1.2 Copy-Based 1 Max bandwidth Copy-Based of ACP lower. 0.8 L1 ACP L2 ACP L2 ACP 0.6 L1 ACP L1 DDR Performance Normalized to Normalized Performance

Copy-Bases Shared Memory Shared Copy-Bases 0.4 0.2

1 4 8 16 32 64 1 4 8 16 32 64 #Iterations #Iterations

Figure 6.11: MC performance for different number of iterations and data sizes. while the ACP gives the best performance for few iterations, the performance using the ACP saturates at 60% of the DDR port in the bandwidth dominated regime. The L1 TLB performs equally well as the L2 TLB, but it does not outperform it, despite the lower look-up latency. The FPGA accelerator uses latency-insensitive DMA transfers to access the main memory. The actual computations are performed on local SPMs for which address translation is not required. Unlike for CPUs, the TLB is not in the critical path for FPGA accelerators. Instead of optimizing the TLB for low look-up latency, the available FPGA resources are better invested in building a TLB with larger capacity. As the date size and the number of iterations increase (Fig. 6.11 b), only the configuration using the large L2 TLB and the DDR port can compete with copy-based shared memory (L2 DDR).

Sparse Matrix-Vector Multiplication (SMVM)

Independent of the problem size, the performance with SVM is at least 1.5x higher compared to copy-based shared memory as shown in Tbl. 6.12. SMVM features a linear access pattern to shared memory. 140 CHAPTER 6. SVM FOR FPGA ACCELERATORS

3 L1 L2 Hybrid

2

1 Performance Normalized to Copy-Based Shared Memory 0 power ca-HepTh Dubcova1 olafu

Figure 6.12: SMVM performance for different sparse matrices.

Every input and output data element is read and written exactly once by the accelerator, respectively. Only compulsory TLB misses happen and the speedup is proportional to the cost ratio between handling a TLB miss and copying a memory page between contiguous and virtual memory (see Tbl. 6.2). The only benefit of a larger TLB (L2) is that it allows for larger DMA transfers, which leads to a slight increase in performance. The use of a hybrid configuration leads to a significant increase in performance. In such a configuration, the matrix is allocated in a virtually and physically contiguous memory region remapped using a single but variable-sized entry in the L1 TLB, and where the input and output vectors are remapped using the page-sized entries of the L2 TLB. Accesses to the matrix thus never miss in the L1 TLB. Since for many practical applications of SMVM, the same matrix is multiplied many times with varying input data and exclusively accessed by the accelerator, it pays off to let the host once place the matrix in a contiguous but uncached memory region, e.g., obtained through CMA. In contrast, the input and output vectors also processed by the host but accessed only once by the accelerator should be allocated in normal, cached memory remapped using the L2 TLB.

6.3.5 Comparison with Related Works We first compare our design with other works reported in literature in terms of resource utilization, clock speed and TLB configuration. Then, we provide a performance comparison using the four studied application kernels. 6.3. EXPERIMENTAL RESULTS 141 (Tech.) . 4 Kintex-7 (28 nm) [ns] n/a Virtex-6 (40 nm) n/a Cyclone-V (28 nm) 26.7 Virtex-6 (40 nm) 32.3 Virtex-5 (65 nm) 32 Time FPGA Family 124.0 Stratix-IV (40 nm) − . 2 6 185 5.4, 16 − 64 64 n/a 32 31 250 16 2 62 64 6 225 32 n/a32 2 100 20.0 Artix-7 (28 nm) ≤ ≤ 407.65 [k] [kbit] Entries [Cycles] [MHz] FFs BRAM #TLB Latency Freq. * [k] 4.12 4.69 46.14 4 & 4 + 1024 1, 3 2.74 2.84 1622.02 1.55 2.67 7.06 5.26 n/a 8 & 4 + 64 2, 5 200 10.0, 25.0 Stratix-V (28 nm) 10.62 LUTs Table 6.5: Comparison of different IOMMU FPGA implementations. On Kintex-7 devices. The scaling factors were obtained by synthesizing an FPGA accelerator for the various device families. * This work Estibals [67]Winterstein [96] 0.07Shamani - [104] 4.06Mirian 0.14 [80] - 8.76 3.70Kornaros [64] Ammendola [88] 14.18Ng [86] 804.22 10.70 64 + 64 8.30 566.54 n/a 114 142 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Hardware Comparison with Related Works

Tbl. 6.5 compares relevant IOMMU FPGA implementations reported in literature with a hybrid configuration of our design featuring 4 variable-sized L1 TLB entries and an L2 TLB (with 4 parallel VA RAMs and 1024 entries). The maximum clock frequency of this configuration is 185 MHz. To further increase the performance of our design, e.g., when used to interface an FPGA accelerator running at faster clock speeds or that saturates the available bandwidth, our design can still be adapted by adjusting the data width of the AXI4 interfaces which is a configurable design parameter. Only three of the considered IOMMU FPGA implementations run faster than our design, whereas in terms of effective look-up time ours is the best. The software-managed MMU for soft processors from Shamani et al. [104] is comparable to our design in terms of FPGA technology, speed and resource utilization. Its two TLB levels are fully-associative and implemented using fully-parallel, pipelined FPGA logic. This enables high clock speeds but makes the TLB resource hungry5 and limits it to substantially smaller capacities. To achieve high clock frequencies, the two fastest designs both implement a (heavily) pipelined TLB using BRAM cells. This leads to a higher resource utilization and a look-up latency which is comparable [64] or much higher [88] than that of our IOMMU, despite the much smaller TLB capacity. Using private HLS-generated translation hardware for every data element shared with the accelerator [67] might lead to a better resource utilization in some cases, but this clearly heavily depends of the target application. In the worst reported case the use of resources is higher than ours, for a design that runs significantly slower. The pipelined page-table walker engine from Winterstein et al. is equipped with two intermediate TLBs and a physically-addressed data cache [96]. This module enables SVM for HLS accelerators described in OpenCL. It uses more resources and runs slower than our proposal, despite the use of direct-mapped TLBs of lower capacity. The smallest of all the considered designs employs a software- managed TLB, but its performance is not among the best, as the TLB

5The L2 TLB alone uses more than 60% of the resources. 6.3. EXPERIMENTAL RESULTS 143 is searched sequentially [80]. Relying on a hardware-managed, virtually- addressed data cache to reduce the pressure on the low-capacity TLB and the memory [86] results in the highest memory consumption. It has to be underlined that all the approaches that we compare to block memory traffic upon encountering the first (or second [64]) TLB miss, and until this miss has been handled. In contrast, our design delivers non-blocking operation. It simply enqueues the missing request, drops the transaction and continues to translate requests.

Performance Comparison with Related Works The focus of previous work on SVM for FPGA accelerators lies on reducing the TLB service time by using either a soft processor [88] or dedicated hardware [64–67, 80, 89, 96] for managing the TLB with a size of 64 entries at most. As opposed to letting the host manage the TLB, this reduces the latency of TLB misses substantially. However, the host is still required to pin the shared memory pages either on the first and compulsory TLB miss to a page (causing a page fault), or to all shared memory pages at offload time. Otherwise, the OS might at some point move the shared page in physical memory or to swap space, thereby causing the corresponding TLB entry to become invalid and causing the accelerator to corrupt the memory of other processes or the OS itself. In addition, most proposed designs support no outstanding TLB misses [67,80,86,88,89,96,104]—or just a single one [64]—without blocking any traffic. Compared to our design which simply enqueues outstanding TLB misses and in parallel continues to serve hitting memory transactions, this leads to a complete serialization of the traffic. Thus, the full interrupt latency has to be paid for every page fault. To estimate the performance of such designs in real applications and compare them with our design, we have run our benchmarks using an L1 TLB with 64 entries and profiled the number of compulsory misses and capacity misses in the TLB. The total run time is then estimated by summing up the accelerator run time in the copy-based memory sharing configuration (no SVM-related overheads during accelerator execution) and the SVM-related overheads for the accelerator execution and the offloading sequence. The cost of a page fault (compulsory miss, first miss to every page) is equal to the total TLB miss-handling cost in our scheme (Tbl. 6.2). 144 CHAPTER 6. SVM FOR FPGA ACCELERATORS

Further TLB misses (capacity misses) can be handled directly on the FPGA and take 1350 host clock cycles in the case of a soft processor managing the TLB [88] or 540 host clock cycles when the TLB is managed in hardware [89]. In case the shared memory pages can be pinned at offload time, there are no page faults. Note however that page pinning at offload time is only an option if the application does not make extensive use of pointer-rich data structures. Otherwise, an application-specific offload sequence is required to traverse the data structure (as with copy-based shared memory). The cost of pinning a single memory page at offload time is equal to the cost of handling a TLB miss in our scheme (Tbl. 6.2, without response and scheduling latency, see Sec. 3.4.2). The accelerator engines have been configured as follows: For PC, we used a vertex size of 44 B, 10 cycles per vertex, and 1 to 4 iterations on the graph similar to PageRank [82]. We used 3 different graphs with 10 k, 40 k and 100 k vertices to vary the total data size. For RFT, we chose 16 tree levels, 28 B for the vertex size, 10 computation cycles per vertex and 256 or 2048 samples. MC was run with a data size of 64 KiB for 1 iteration to match the loading of the multi-channel image patches prior to an RFT-like classification phase [83]. For SMVM, we used the four matrices listed in Tbl. 6.4. PC: As shown in Fig. 6.13 a), the use of a soft processor (SW MGMT) or dedicated hardware (HW MGMT) together with a single- cycle, 64-entry TLB does not necessarily lead to higher performance. The reason is mainly that compulsory TLB misses lead to page faults, which need to be handled by the host, even if the TLB is managed by the accelerator. Independent of the SVM scheme, the speedup compared to copy-based shared memory decreases as the number of iterations on the graph increases which amortizes the initially high offload cost. Only as the graph size increases beyond the capacity of the large TLB (100 k vertices), managing the TLB on the FPGA starts to pay off. This is due to the much smaller latency for handling capacity misses directly on the FPGA. Selecting a larger TLB size (L2, 2048 Entries, 10 Cycles) in our system allows to avoid capacity TLB misses and the associated performance degradation. The higher maximum look-up latency does not affect performance also for smaller graphs in PC. 6.3. EXPERIMENTAL RESULTS 145

a) PC b) RFT 4 Higher is better. 4

3 3

2 2

1 1 Performance Normalized to Normalized Performance Copy-Based Shared Memory Shared Copy-Based 0 0 1 Iteration 4 Iterations 1 Iteration 4 Iterations 1 Iteration 4 Iterations 256 Samples 2048 Samples 10k Vertices 40k Vertices 100k Vertices c) MC d) SMVM 3 3

2 2

1 1 Performance Normalized to Normalized Performance Copy-Based Shared Memory Shared Copy-Based 0 0 power ca-HepTh Dubcova1 olafu

L2, 1024 Entries, 6 Cycles Hybrid, 1024 Entries, 6 Cycles 64 Entries, 1 Cycle, SW MGMT 64 Entries, 1 Cycle, SW MGMT, No Faults L2, 2048 Entries, 10 Cycles Hybrid, 2048 Entries, 10 Cycles 64 Entries, 1 Cycle, HW MGMT 64 Entries, 1 Cycle, HW MGMT, No Faults

Figure 6.13: Normalized performance of different SVM designs for a) PC, b) RFT, c) MC and d) SMVM.

RFT: Since RFT does not make heavy use of virtual address pointers, the shared data can be pinned at offload time to avoid page faults. In this case, all TLB misses can be handled on the FPGA (64 Entries, 1 Cycle, SW/HW MGMT, No Faults) as shown in Fig. 6.13 b). The performance is close to that of copy-based shared memory also for larger numbers of samples. The highest performance is achieved by the hybrid design. A larger TLB with 10 instead of 6 cycles max look-up latency has little impact on performance (2%) when the number of samples is small only. MC features a linear access pattern to SVM and thus only produces compulsory TLB misses and page faults but no capacity misses. In our design, the IOMMU does not block on the first TLB miss. Together with the prefetching transactions, this allows to queue multiple TLB misses, which can be handled by the host in one batch. This allows to increase performance by roughly 60% compared to when managing the TLB on the FPGA as shown in Fig. 6.13 c). Pinning the memory pages at offload time improves 146 CHAPTER 6. SVM FOR FPGA ACCELERATORS performance of such designs but it does so at most slightly beyond what is achievable with our framework. SMVM: For SMVM, the performance is bound by handling com- pulsory misses/page faults as shown in Fig. 6.13 d). Pinning the shared data at offload time helps to improve performance. However, the performance of the designs managing the TLB on the FPGA is always below that of our design. The reason is twofold: First, these designs support neither queuing of page faults nor multiple outstanding TLB misses. While either of the two is being handled, the IOMMU blocks any traffic to SVM. Second, SMVM uses multiple parallel DMA streams to stream in the sparse matrix in CISR format, which further amplifies the benefits of supporting multiple outstanding TLB misses. The highest performance is achieved when using a hybrid design as enabled by our SVM framework.

6.4 Summary

In this chapter, we have presented a plug-and-play framework for exploring SVM for FPGA accelerators in HESoCs. The design consists of a configurable IOMMU IP core performing virtual-to-physical address translation for the accelerator’s accesses to shared memory. This block is managed in software by a kernel-level driver and a user-space runtime library linked to the application. It allows the application programmer to simply share virtual address pointers with the FPGA accelerator without the need to use specialized memory allocators and the like. The low-level details are handled by the framework without incurring any offload-time overheads. We have evaluated the design using parameterizable benchmarks operating on pointer-rich data structures representative of applications for which SVM is a must to enable reasonable design time effort, and that heavily stress the SVM subsystem. The proposed design allows for speedups between 1.5 and 12x compared to copy-based shared memory—which is still the state of the art for HESoCs. We have found that, unlike for CPUs, TLB look-up latency is not critical for FPGA accelerators—as the TLB is not in the accelerator’s critical path to the SPM, but mainly accessed for latency-insensitive DMA transfers. Relaxing the TLB look-up latency allows for the construction of larger TLBs and 6.4. SUMMARY 147 thus lower miss rate and overall miss-handling overhead with less FPGA resources. Our design offers multiple tuning knobs to tailor the address-translation hardware to the application at hand. Support for multiple outstanding TLB misses and multiple-miss handling in batch mode are the capabilities of our design that proved the most beneficial for the performance of parallel FPGA accelerator architectures. The presented results further show that, contrary to intuition, coherent access to the host’s data caches is not always beneficial. While avoiding the need for costly cache flushes reduces the miss-handling latency, the maximum memory bandwidth is 45% lower compared accessing the data directly through a dedicated port on the DDR memory controller. For optimal performance and maxi- mum flexibility, our framework allows the application programmer to optionally specify the interface to use on an address range or shared data element basis. In addition, the runtime library and the kernel-level driver of our framework can be adapted to allow the use of hardware IOMMUs as found in some next-generation HESoC to be used for true SVM, far beyond what is achievable with standard software distributions as for example provided for the Xilinx Zynq UltraScale+ MPSoC. Also, combining the proposed IOMMU hardware with on-accelerator VM management as discussed in Chap. 5 allows to further improve efficiency and performance of the design. Both these options are evaluated in Chap. 7.

Chapter 7

Full-Fledged IOMMU Hardware for SVM

In the previous chapters of this thesis, we discussed the design of mixed hardware-software frameworks for enabling lightweight SVM in power- and area-constrained HESoCs as well as optimizations to the most critical framework components such as the IOTLB and its management. In contrast, some of the next-generation, fully- programmable HESoCs gradually becoming available [27] feature full- fledged, hardware IOMMUs integrated into the system as hard-macro IP cores [9]. In principle, the employed hardware blocks would allow for SVM between the host processor and various (I/O) devices including custom accelerators implemented in FPGA hardware, similar to the IOMMUs found in modern, high-end desktop processors [7, 8]. However, studying the associated IOMMU software framework reveals severe limitations. On one hand, the corresponding Linux IOMMU API implementation does not allow to directly associate a user-space process with an IOMMU address translation context. Instead, the software always generates an empty IOVA space upon hardware initialization. Additional software is not provided but needed to explicitly map user-space memory pages to this IOVA space to make it accessible to the accelerator through the IOMMU hardware. On the other hand, the basic, low-level hardware driver for the IOMMU

149 150 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM integrated into the Linux kernel just serves the purpose of registering the hardware with the system and interfacing it with the Linux IOMMU API. Interfacing the IOMMU hardware also with the actual application to be accelerated requires additional user-space as well as kernel-space software which is is not publicly available. As a consequence, the IOMMU cannot be easily used for implementing SVM between host processor and custom accelerators. In fact, the standard software distributions and programming models currently not foresee to use the IOMMU for giving custom accelerators direct access to user-space memory in these HESoCs but rather for isolating the host system from malicious or faulty DMA devices and drivers (see Sec. 1.3) [29]. In this chapter, we present the required software stack for en- abling SVM using the full-fledged hardware IOMMUs available in next-generation HESoCs. In particular, we use the Xilinx Zynq UltraScale+ MPSoC featuring an ARM system memory management unit (SMMU) and extend the software of our SVM framework to also enable SVM using this hard-macro IOMMU. Using a set of parameterized benchmarks extracted from real-world applications, we then evaluate the performance of the resulting framework. Finally, we provide a performance comparison with our lightweight SVM framework for PMCAs comprising the extensions to the PMCA cluster hardware and compiler presented in Chap. 4, the software library for on-accelerator VM management presented in Chap. 5, and the hybrid, high-capacity IOTLB design presented in Chap. 6. Our results demonstrate that to fully exploit the potential of modern, full-fledged IOMMU hardware for SVM, modifications to low-level drivers and IOMMU software frameworks are required. Otherwise, the performance is reduced by 40% to 70%. The performance of our lightweight SVM framework lies within 25% of what is achievable with hardware IOMMUs and optimized software stacks in non-strictly memory-bound scenarios. Due to the support for physically contiguous VM, it can even outperform such designs and achieve near-ideal performance. The rest of this chapter is organized as follows. Related work is discussed in Sec. 7.1. Sec. 7.2 gives an overview of the employed hard-macro IOMMU, the required software stack, and how it can be integrated in our SVM framework. The performance of the resulting SVM framework is evaluated and compared to our lightweight SVM design in Sec. 7.3. 7.1. RELATED WORK 151

7.1 Related Work

Originally, hardware IOMMUs were introduced to protect the host system from malicious or faulty DMA devices and drivers in high- throughput I/O scenarios (see Sec. 1.2) [24]. Since such devices typically directly operate on kernel memory, sharing the kernel page table (PT) with the IOMMU would still more or less expose the entire system. Only the creation of a separate IOVA space, e.g., by the Linux IOMMU API, and the explicit mapping of the DMA buffer to this address space allows to adequately protect and isolate the host system. However, the cost of mapping and unmapping a memory page to this IOVA space are high and these operations can quickly become the main bottleneck in such scenarios [24, 25]. To use hardware IOMMUs for efficiently sharing virtual user-space memory with accelerators such as embedded or discrete GPGPUs in today’s high-end desktop processors [7, 8] required substantial modifications to various components. For example, the OSs and kernel APIs required adaptions to enable the direct sharing of user- space PTs between host processor and IOMMU. In addition, the IOMMU hardware needs to sustain the substantially higher degrees of parallelism [109]. But also the accelerator hardware, as well as drivers and runtime systems must be adapted to efficiently use the IOMMU hardware [10, 11]. Similar IOMMU hardware combined with custom software stacks also enables SVM in FPGA-accelerated high-performance computing (HPC) systems [2–4]. As for embedded systems, the situation is different. While some high-end HESoCs support SVM between ARM-based CPUs and embedded GPGPUs, the internals of these systems are not known to the public and the software stacks are completely closed [13, 14]. Similarly, some GPUs for embedded SoCs feature dedicated hardware MMUs, but these operate on a special PT format different from that of the CPU, and they can only be used through proprietary and closed software stacks [118]. In contrast, the hardware IOMMUs found in some next-generation, fully-programmable HESoCs such as the Xilinx Zynq UltraScale+ MPSoC come with basic open-source driver software, there exists free documentation, and they are exposed to the developer. But most importantly, these IOMMUs are also in the path from the 152 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

FPGA fabric to main memory. As a consequence, they cannot just be used for studying SVM, e.g., in the context of GPGPUs and associated programming models, but these IOMMUs can be leveraged to build a custom SVM framework and studying SVM with any type of accelerator, be it a custom hardware accelerator or a PMCA emulated in FPGA logic.

7.2 Infrastructure

For the work presented in this chapter, we used the Xilinx Zynq UltraScale+ MPSoC [27]. This next-generation, high-end HESoC is the first device on the market to feature a programmable FPGA fabric, i.e., the programmable logic (PL), with a full-fledged, hard-macro IOMMU in the path from the PL to main memory. As such, it is perfectly suited to evaluate the potential of full-fledged hardware IOMMUs for enabling SVM in HESoCs, and to compare it with more lightweight SVM schemes as designed in this thesis. In Sec. 7.2.1, we first give an overview of the hardware architecture and operation of the IOMMU employed in the Zynq MPSoC. The software stack is then discussed in Sec. 7.2.2.

7.2.1 SMMU Architecture and Operation The hardware IOMMU employed in the Xilinx Zynq UltraScale+ MPSoC is an ARM CoreLink MMU-500 system memory management unit, the SMMU [28]. This device is a highly flexible and config- urable IP core which supports address translation for multiple parallel, independent devices in ARMv7- and ARMv8-based host systems. Besides virtual-to-physical address translation as required for SVM, the SMMU also supports two-stage address translation for virtualization environments.1 Fig. 7.1 shows how the SMMU is integrated into the Zynq MPSoC. The SMMU features two different core components. Translation buffer unit (TBU): Multiple TBUs interface the PL and other master devices such as the system-level DMA engine,

1In such a scenario, the SMMU performs first a Stage 1 translation from the incoming VA to the intermediate PA of the virtual machine, followed by a Stage-2 translation from the intermediate PA to the output PA of the host. 7.2. INFRASTRUCTURE 153

Programmable System Programmable Logic

A53 A53 A53 A53 Core 0 Core 1 Core 2 Core 3 MMU MMU MMU MMU L1 L1 L1 L1 L1 L1 L1 L1 I$ D$ I$ D$ I$ D$ I$ D$ SMMU Snoop Control Unit TCU L2 $

Cache-Coherent Interconnect TBU 0 HPC0 Port

DDR Memory Controller TBU 3 HP0 Port

DDR DRAM

Figure 7.1: Overview of Zynq MPSoC with the integrated SMMU.

network and display port controllers, as well as the PCIe root complex with the downstream system interconnect using different versions of the AXI protocol. In total, the SMMU features 6 TBUs. For simplicity, the figure just shows the TBUs relevant for this work. TBU 0 connects the high-performance coherent (HPC) AXI master port (HPC0) of the PL to an ACE-Lite slave port of the cache-coherent interconnect. TBU 3 connects the non-coherent HP0 port to the DDR DRAM controller (both AXI). Using a private, fully-associative L1 TLB, every TBU performs virtual-to-physical address translation for incoming transactions. Every TBU features private transaction and write data buffers which allow for up to 8 or 16 outstanding misses in the L1 TLB before blocking (HUM). Translation control unit (TCU): In case the L1 TLB inside a TBU does not hold a valid mapping for an incoming address, a translation request is sent to the TCU using a dedicated AXI stream interface. The TCU is shared among all TBUs and is responsible for controlling and managing address translations. It features a shared L2 TLB as well as a multi-threaded, hardware PTW engine that connects to an ACE-Lite slave port of the system interconnect for coherently operating on the PTs in main memory. To speed up the PTW engine, the TCU is equipped with hardware-managed prefetch buffers and PTW caches. In addition, the TCU features a distributed 154 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM virtual memory interface to participate in TLB invalidations issued by the host processor.

Stream Matching: A TBU can be shared between different mas- ter devices and simultaneously perform address translations according to different PTs or address translation contexts. Also, multiple TBUs can simultaneously perform address translation for the same context, e.g., to increase the memory bandwidth of a single accelerator by using multiple interfaces and TBUs in parallel. This flexibility is achieved using a technique called stream matching: Whenever a new transaction arrives at a TBU, its ID which is a composition of the TBU index, the interconnect master ID and the AXI transaction ID is compared with the contents stored in the stream-matching registers. In case the ID matches with the content of one of the stream-matching registers, the SMMU performs address translation according to the context configured in the associated stream-to-context register. If the SMMU shall not perform address translation for a particular stream, the stream-to-context register can be configured accordingly (bypass mode). The SMMU implemented in the Zynq MPSoC features 48 stream matching registers and supports 16 different contexts.

Translation Faults and TLB Misses: If the SMMU cannot translate the address of an incoming transaction, e.g., because the PT does not contain an entry for the virtual address (VA) of interest, it raises a fault interrupt, which needs to be handled in software running on the host. In this case, the SMMU blocks any traffic (also for other contexts and streams in bypass mode). Once the fault has been handled, the SMMU retries to translate the faulting address and then issues the transaction into the downstream interconnect. Similar to TLB misses, translation faults are handled completely transparently to the master which issued the original transaction. To enable the same transparency for read and write transactions, every TBU features a buffer memory for absorbing the payload data of write transactions. However, if this write buffer in the TBU is full, for example because of a large DMA write transfer generating a TLB miss, any write transfers on this TBU are blocked until the miss is handled. 7.2. INFRASTRUCTURE 155

Abstraction Layer: User-Space Application User-Level Software Runtime Lib

Linux Driver Module Kernel- Kernel Level IOMMU API Software get_user _pages() CMW SMMU Driver

SMMU Hardware Host Processor Hardware

Figure 7.2: Software stack for enabling SVM using the SMMU.

7.2.2 SMMU Software Stack The SMMU is registered with the OS kernel through a low-level hardware driver integrated into the Linux kernel. At startup, this driver pre-allocates stream-matching and context registers according to information found in the device tree, and initializes the hardware for by- passing. Also, this driver provides the SMMU-specific implementations of the functions specified in the Linux IOMMU API. As such, it allows other kernel subsystems, such as the DMA framework, to use this particular SMMU as any other IOMMU through the same interface. However, to actually use the SMMU for enabling SVM between a user-space process and an accelerator, additional software is needed, which is typically not provided. To this end, we extended software part of our IOMMU framework presented in Chap. 6 to also enable SVM using the SMMU. Fig. 7.2 shows how the different software components interact with each other and the SMMU hardware. Once the user-space application to be accelerated is started, the runtime library uses a system call to the kernel-level driver module, which then initializes the SMMU for SVM through the IOMMU API. In particular, it first allocates and configures the a new IOMMU domain context, which basically involves setting up different data structures used by the API later on. After that, a call to the IOMMU API function iommu attach device() associates the accelerator with the context and configures the actual hardware, i.e., it sets up the SMMU context bank and stream-matching registers. Note that this function always creates an empty IOVA space to which the driver module must 156 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM explicitly map user-space pages.2 If the shared memory pages are known in advance, the driver module can at offload time pin and map them to kernel space using get user pages() and then map them to the IOVA space using iommu map(). Alternatively, the driver module creates a kernel worker thread using the Concurrency Managed Workqueue (CMW) API to pin and map user-space pages on demand and registers it with the fault handler to the IOMMU context. This fault handler and the worker thread are then called when the hardware PTW engine inside the TCU cannot find an entry for the incoming VA and thus generates a page fault interrupt to the host. In contrast to the implementation discussed in Sec. 6.2.2, this worker thread cannot handle multiple faults in batch mode but just a single fault at a time as the SMMU just supports one outstanding fault. Consequently, the full interrupt latency has to be paid for every fault. Mapping the page to the IOVA space at the same address used by the user-space application allows to replicate the relevant segments of the PT and thus the sharing of virtual address pointers between the host processor and the accelerator.

7.3 Experimental Results

We first describe the evaluation platform used to do the performance evaluation of the SMMU in Sec. 7.3.1. Sec. 7.3.2 presents the cost of different SMMU primitives and architectural parameters profiled and determined using a microbenchmark. The results for different application kernels are discussed in Sec. 7.3.3.

7.3.1 Evaluation Platform Our evaluation platform is based on the Trenz Electronic UltraSOM+ TE0808-03 module [119] equipped with a Xilinx Zynq UltraScale+ XCZU9EG MPSoC [27]. Fig. 7.3 gives an overview of this HESoC. This chip features a quad-core ARM Cortex-A53 CPU that is running Xilinx

2To allow the SMMU to operate directly and coherently on the page table of the user-space process as supported by the hardware, both the IOMMU API and the SMMU driver need to be modified substantially. 7.3. EXPERIMENTAL RESULTS 157

Linux 4.9 in 64-bit mode and used to implement the host of the HESoC. The cores have separate L1 instruction and data caches with a size of 32 KiB each and share 1 MiB of unified L2 cache. The PL of the Zynq MPSoC is used to implement a cluster-based PMCA architecture [40]. Per cluster, 8 RISC-V PEs share 4 KiB of instruction cache, 256 KiB of multi-banked, tightly-coupled L1 SPM, and a multi-channel DMA engine which allows for fast and flexible movement of data between the L1 and the L2 SPM or main memory at high-bandwidth. The PMCA is attached to the host as a memory-mapped device and can be controlled by a kernel-level driver and a user-space runtime linked to the actual application. To guarantee compatibility of data and pointer types between the 64-bit ARMv8 host CPU and the 32-bit RISC-V PMCA, the application is compiled and executed in 32-bit ARMv7 mode. The host and the PMCA share 2 GiB of DDR4 DRAM as main memory. This memory is accessible by the PMCA both through the cache-coherent interconnect of the host, as well as through a dedicated non-coherent port directly connecting to the DDR memory controller. To enable SVM between the host and the PMCA, the platform can use different IOMMU implementations: It can either use the hard-macro SMMU or the lightweight IOMMU designed in this thesis. At application start, the runtime prepares the selected IOMMU. The IOMMU which is not used is put into bypass mode. In contrast to the full-fledged, hard-macro SMMU, the lightweight IOMMU is implemented using PL and managed through the VMM software library executed by a helper thread on the PMCA (refer to Chap. 5). The lightweight IOMMU uses a hybrid TLB design with 8 L1 entries that are fully-associative, have a flexible size and a look-up latency of 1 clock cycle, and a 32-way set-associative L2 TLB with 1024 entries and a maximum look-up latency of 6 cycles (see Chap. 6). The internal configuration parameters of the SMMU, such as TLB sizes and look-up latencies, are not specified. The clock frequencies of the different components were tuned to emulate a system with a PMCA and system-level interconnects including SMMU running at 500 MHz and the host CPU running at 2140 MHz. The PMCA features one cluster. The DDR4 DRAM is running at clock frequency of 200 MHz (DDR4-1600). 158 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

Host PMCA

A53 A53 A53 A53 L2 SPM Cluster 0 Core 0 Core 1 Core 2 Core 3 DMA L1 I$+SPM MMU MMU MMU MMU Lightweight L1 L1 L1 L1 L1 L1 L1 L1 IOMMU I$ D$ I$ D$ I$ D$ I$ D$ L1 TLB Cluster 1

SoC Bus DMA L1 I$+SPM L2 TLB Snoop Control Unit L2 $ HPC0 Port HP0 Port

Cache-Coherent Interconnect TBU 0 SMMU Cluster N-1 DDR Memory Controller TBU 3 DMA L1 I$+SPM

DDR DRAM

Figure 7.3: The evaluation platform with the two IOMMU implemen- tations under comparison.

7.3.2 Shared Virtual Memory Cost

Using a synthetic benchmark application, we profiled the SMMU in order to determine unspecified architectural parameters, such as the latencies and capacities of the L1 and L2 TLBs, the cost for handling a page fault and the PTW latency. This benchmark lets the host allocate a large, shared array of which the VA is passed to the PMCA. A PE inside the PMCA first performs a large number of accesses to main memory with both IOMMUs in bypass mode and determines the main memory access latency using its internal performance counters. Then, the host configures the SMMU for address translation and the PMCA PE accesses all pages of the shared array through the SMMU, which allows to determine the latency for handling a page fault. To determine the look-up latency of the L1 TLB, the PE then issues a large number of accesses to a selected memory page. To determine the capacity, the PE then starts to insert an increasing number of accesses to other memory pages in between two accesses to the selected page. Once the access latency to the selected page increases, the capacity of the L1 TLB has been found: It equals the number of pages accessed since the last access to the selected page. With this information, a similar scheme can be used to determine the 7.3. EXPERIMENTAL RESULTS 159

Table 7.1: SMMU architectural parameters.

Capacity Latency [Cycles] L1 TLB 64 1 L2 TLB 512 39a

PTWb 31 (71 max.) Page Fault 2900 (21100 max.) a Incl. request latency from TBU to TCU. b Excl. L2 TLB look-up latency. latency and size of the L2 TLB. Finally, the PE issues strided accesses to the shared array to measure the PTW latency of the SMMU. The results obtained from this benchmark are visualized in Tbl. 7.1. The 64 entry, fully-associative L1 TLBs inside the TBUs have a look-up latency of 1 clock cycle only. The shared L2 TLB inside the TCU has a capacity of 512 entries. Its latency is substantially higher (39 cycles), as upon a miss in an L1 TLB, the TBU must first send a translation request to the TCU over the AXI stream interface. The maximum latency of a PTW performed in the TCU is 71 clock cycles. Thanks to the prefetch buffers and PTW caches, the average PTW latency is substantially lower (31 cycles). Note that this number does not include the latency of the AXI stream interface between TBU and TCU, nor the look-up latencies of the TLBs. The average latency for letting the host handle an SMMU page fault on the HPC0 interface (no cache flush needed) is 2900 clock cycles. During that time, the SMMU is blocked for any transactions. In contrast, the lightweight IOMMU designed in this thesis is managed in software that directly operates on the PT of the user-space process. The host thus does not need to be involved. A software PTW performed by a PE PMCA using the VMM library takes on average 320 clock cycles and the maximum latency is 630 cycles. This design does not make use of PMCA-internal hardware caches and/or prefetching. However, to reduce the average latency, it can reuse partial results of the previous PTW (see Sec. 5.3.3). Also, it suffers from a higher latency to main memory (72 cycles) than the hardware PTW inside the SMMU. 160 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

7.3.3 Application Benchmark Results

In this section, we evaluate the performance of both the SMMU and the lightweight IOMMU designed throughout this thesis. To this end, we use the four benchmarks memory copy (MC), pointer chasing (PC), random forest traversal (RFT) and sparse matrix- vector multiplication (SMVM) described in Sec. 5.4.2 and Sec. 6.3.1, respectively. These benchmarks represent typical applications suitable for the implementation on massively-parallel accelerator architectures. MC and SMVM feature rather regular memory access pattern typical for streaming applications of low operational intensity. In contrast, PC and RFT operate on pointer-rich data structures. Support for SVM is absolutely needed to allow for efficient heterogeneous implementations of such application kernels at reasonable design-time effort. Their memory access patterns are highly irregular. The benchmarks were configured with the parameters given in Tbl. 6.3 and Tbl. 6.4 and run using the following SVM configurations based on the two IOMMU implementations under comparison.

• L-IOMMU : This configuration uses the lightweight SVM frame- work designed throughout this thesis. The VMM library through which the PMCA manages the VM hardware directly operates on the PT of the user-space process.

• SMMU : This configuration leverages the software stack discussed in Sec. 7.2.2 to enable SVM through the full-fledged hardware IOMMU found in next-generation HESoCs (the SMMU). The SMMU operates on a dedicated IOVA space and I/O PT to which the host must explicitly map the memory pages of the accelerated user-space process.

• SMMU, No Faults: This configuration uses the SMMU for enabling SVM. However, the SMMU directly operates on the PT of the user-space process similar to L-IOMMU. There is no need to allocate a separate IOVA space to which user-space memory pages must be mapped after pinning. There are no page faults. 7.3. EXPERIMENTAL RESULTS 161

Note that to implement the latter scheme with current versions of the Linux kernel, heavy modifications to the IOMMU API imple- mentation and the low-level SMMU hardware driver are required.3 It can be emulated by mapping all shared data pages to the IOVA space at offload time, i.e., before the PMCA starts to operate. While this allows to avoid any page faults at PMCA run time, it leads to considerable offload-time overheads and requires the implementation of an application-specific routine to traverse and map the shared data. Therefore, the scheme SMMU, No Faults rather demonstrates the capabilities of the SMMU hardware with future software stacks. In addition, the RFT and SMVM benchmarks both operate on large data sets which may be created once by the host and then used by the PMCA for many iterations with varying input data. These data sets are amenable to be placed in physically contiguous memory, e.g., obtained via CMA [30]. For these two benchmarks, we thus use the two following, additional SVM configurations.

• L-IOMMU, Hybrid: Similar to L-IOMMU but the large data sets (trees for RFT, matrix for SMVM) are placed in physically contiguous memory which can be statically mapped by a single entry in the L1 TLB of the lightweight IOMMU. • SMMU, Hybrid: Similar to SMMU but the large data sets are placed in physically contiguous memory. This memory section is mapped to the IOVA space when initializing the SMMU. Thus, there are no page faults to the contiguous memory. Using larger memory pages with sizes between 64 KiB and 1 GiB helps to reduce the VMM overhead.

In the following, all these configurations are compared to an ideal SVM system, i.e., a TLB with single-cycle look-up latency that never misses. Such a system can be emulated by putting both IOMMUs in the design in bypass mode and letting the host copy the shared data to a reserved, physically contiguous memory section and translate any virtual address pointers in the shared data at offload time. This

3In addition, a hardware bug in the SoC requires a PT patch to allow the PL to access the main memory coherently with the caches of the host when using the SMMU for address translation. While this patch is suitable for I/O PTs, applying it also to regular PTs used for the kernel and for user-space applications is critical. 162 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

a) Data Size = 64 KiB b) Data Size = 1024 KiB 1.0 1.0 SMMU, No Faults SMMU, No Faults L-IOMMU L-IOMMU 0.8 0.8

0.6 0.6

SMMU SMMU

0.4 0.4 Performance

Normalized to Ideal SVM Ideal to Normalized 0.2 0.2

1 16 32 64 1 16 32 64 #Iterations #Iterations

Figure 7.4: MC performance for different data sizes and number of iterations. allows for zero overhead at PMCA run time, but obviously, it leads to considerable design-time overheads and high offload cost, which are to be avoided using SVM.

Memory Copy (MC) MC features a very regular access pattern to SVM and is represen- tative for streaming-type, memory-bound application kernels. The benchmark uses a single DMA stream to copy a data buffer from SVM to the L1 SPM of the PMCA. Fig. 7.4 shows the performance of the considered SVM schemes normalized to ideal SVM for a data size of 64 KiB and 1024 KiB in a) and b), respectively. The x-axes denote the number of iterations performed on the data as required when for example the size of the L1 SPM does not suffice to hold all data at once. The best performance is achieved using a full-fledged hardware IOMMU directly operating on the PT of the user-space process (SMMU, No Faults). If the entire buffer can be remapped using the L1 TLB, ideal performance can be achieved after a couple of iterations (Fig. 7.4 a) ). If the hardware IOMMU must use a dedicated IOVA space to which the shared pages are mapped by the host upon page faults, the performance drops substantially by up to 60% (SMMU ). The lightweight, software-managed SVM design developed in this thesis, i.e., L-IOMMU, allows to achieve 77% of the performance of an 7.3. EXPERIMENTAL RESULTS 163 ideal system and more. Multiple iterations help to amortize the cost for page faults (SMMU ) and compulsory TLB misses (SMMU, No Faults and L-IOMMU ). The maximum bandwidth using the L-IOMMU saturates at 98% of ideal SVM as shown in Fig. 7.4 b).

Sparse Matrix-Vector Multiplication (SMVM)

SMVM uses multiple, parallel DMA streams each of which features a linear access pattern to SVM. Fig. 7.5 shows the PMCA performance of the different SVM configurations normalized to the performance of an ideal SVM system for different matrices (refer to Tbl. 6.4). The lowest performance is achieved by SMMU : Upon a page fault, any accelerator traffic to SVM is blocked, also for other DMA streams. Since every input and output data element is written exactly once by the PMCA, the page faults cannot be amortized. The performance can be notably increased by storing the matrix in contiguous memory statically mapped to the IOVA space (SMMU, Hybrid) or by letting the SMMU directly operate on the PT of the user-space process (SMMU, No Faults) to reduce the number of page faults or completely avoid them, respectively. Independent of the problem size, the highest performance is achieved when using the SVM framework designed throughout this thesis (L- IOMMU, Hybrid). If the matrix is not stored in contiguous memory statically mapped (L-IOMMU ), the design is still between 1.3x to 1.5x faster than SMMU and achieves 65% to 75% of the performance of an ideal SVM system.

L-IOMMU L-IOMMU, Hybrid SMMU SMMU, Hybrid SMMU, No Faults 1.0

0.8

0.6

0.4 Performance 0.2

Normalized to Ideal SVM Ideal to Normalized 0 power ca-HepTh Dubcova1 olafu

Figure 7.5: SMVM performance for different sparse matrices. 164 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

Pointer Chasing (PC) The PC benchmark traverses a complex, randomized graph structure consisting of 10,000 vertices and represented using linked lists (see Fig. 5.8 a) ). SVM is badly needed to allow for the efficient imple- mentation of such application kernels on heterogeneous platforms at reasonable design-time effort. Its access pattern to SVM is highly

a) Graph Size = 2.5 MiB b) Graph Size = 2.5 MiB 1 Iteration 4 Iterations 1.0

SMMU, No Faults SMMU, No Faults L-IOMMU 0.8 SMMU L-IOMMU 0.6

SMMU

0.4 Performance

0.2 Normalized to Ideal SVM Ideal to Normalized

0.1 1.0 10.0 0.1 1.0 10.0 Operational Intensity [Cycles/B] Operational Intensity [Cycles/B]

c) Graph Size = 10 MiB d) Graph Size = 10 MiB 1 Iteration 4 Iterations 1.0

SMMU, No Faults SMMU, No Faults 0.8

SMMU 0.6 L-IOMMU

0.4 L-IOMMU Performance SMMU 0.2 Normalized to Ideal SVM Ideal to Normalized

0.1 1.0 10.0 0.1 1.0 10.0 Operational Intensity [Cycles/B] Operational Intensity [Cycles/B]

Figure 7.6: PC performance for different graph sizes (top/bottom) and number of iterations (left/right). irregular, data-dependent and offers only little temporal locality. The benchmark represents a worst-case scenario for VM systems. Fig. 7.6 shows the achievable performance in PC relative to an ideal SVM system as a function of the operational intensity for different graph sizes and number of iterations performed on the graph. 7.3. EXPERIMENTAL RESULTS 165

As for the previous benchmarks, the availability of a full-fledged hardware IOMMU itself does not guarantee for high performance. To achieve high performance that is within 8% of an ideal SVM system, the IOMMU must directly operate on the PT of the user-space process which is not possible with the current, embedded software stacks (SMMU, No Faults). Otherwise, the handling of page faults and mapping the shared user-space memory to the dedicated IOVA space on the host severely limits performance (SMMU ). Performing multiple iterations allows to amortize the handling of these page faults and brings the performance closer to SMMU, No Faults as shown in Fig. 7.6 b) and d). As the operational intensity increases, the various operations for VM management (page fault handling, PTW) can be partially overlapped with actual computation, which improves performance for all SVM designs. The performance of our lightweight, software-managed SVM design L-IOMMU is within 54% of the ideal SVM system even in the highly memory-bound case. For more reasonable operational intensities and/or if the graph is sufficiently small to be remapped with the TLB of our design at once (Fig. 7.6 a) and b) ), the performance is within 25% of the ideal system.4

Random Forest Traversal (RFT) The RFT benchmark is another typical example for a wide range of applications for which the support for SVM eases a lot the im- plementation on heterogeneous platforms. It operates on multiple, binary decision trees of variable size typically used for regression and classification. Similar to PC, the memory access pattern is highly irregular and represents a worst-case scenario for VM systems. In addition, it depends on other input data not part of the trees such as the samples to classify. Fig. 7.7 shows the performance for RFT for the different SVM schemes as a function of the three depth (left) and operational intensity (right) for different vertex sizes (top/bottom). For a vertex size of 32 B, all SVM schemes achieve a performance above 90% of the ideal

4For these experiments, an L2 TLB with 1024 entries was used which allows to remap 4 MiB of memory. As shown in Chap. 6, the capacity of our TLB architecture can be increased easily at low hardware cost to match larger problem sizes. 166 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM

Vertex Size = 32 B Vertex Size = 32 B a) L-IOMMU, b) Operational Intensity = 1 Cycles/B Hybrid 16 Levels, Tree Size = 2 MiB 1.0 1.0 SMMU, SMMU, No Faults L-IOMMU, Hybrid Hybrid SMMU, Hybrid

L-IOMMU SMMU, 0.8 0.8 No Faults L-IOMMU SMMU SMMU 0.6 0.6

0.4 0.4

0.2 0.2 Performance Normalized to Ideal SVM Ideal to Normalized Performance Performance Normalized to Ideal SVM Ideal to Normalized Performance

4 8 12 16 0.1 1.0 10.0 # Tree Levels Operational Intensity [Cycles/B]

Vertex Size = 256 B Vertex Size = 256 B c) L-IOMMU, d) Operational Intensity = 1 Cycles/B Hybrid 16 Levels, Tree Size = 16 MiB 1.0 1.0 SMMU, L-IOMMU, Hybrid Hybrid SMMU L-IOMMU SMMU, Hybrid SMMU, SMMU, No Faults 0.8 No Faults 0.8

SMMU 0.6 0.6

L-IOMMU 0.4 0.4

0.2 0.2 Performance Normalized to Ideal SVM Ideal to Normalized Performance SVM Ideal to Normalized Performance

4 8 12 16 0.1 1.0 10.0 # Tree Levels Operational Intensity [Cycles/B]

Figure 7.7: RFT performance for different vertex and tree sizes (top/bottom) when varying the number of tree levels (left) and the operational intensity (right).

design up a tree depth of 14 levels (Fig. 7.7 a) ). For larger depths, the capacity of the L1 TLB of the SMMU does no longer suffice to remap the entire tree. The performance starts to decrease rapidly, as the distance between subsequently accessed vertices increases with growing tree depth (see Fig. 5.8 b) ), leading to a TLB miss on every SVM access. With increasing vertex size, the three depth at which performance starts to decrease is shifted to the left (Fig. 7.7 c) ). 7.4. SUMMARY 167

Fig. 7.7 b) and d) show how the relative performance increases with varying operational intensity for a fixed three depth and size. A higher operational intensity allows to overlap useful computation on the accelerator with TLB miss handling. This behavior is more pronounced for larger vertex sizes (Fig. 7.7 d) ). Unless on-line learning is performed, the host just accesses the decision trees once when preparing them for the accelerator. In this case, the trees are best allocated in physically contiguous VM, e.g., obtained through CMA [30]. The trees can then be statically remapped whereas the input and output samples are allocated using regular VM. For the hardware IOMMU using a dedicated IOVA space, this allows to drastically reduce the number of page faults and thus for higher performance (SMMU, Hybrid). However, despite the software in this case makes use of large memory pages in the IOVA space, the performance does not improve beyond SMMU, No Faults (regular host memory just uses 4 KiB pages). This suggests that, while the software and the hardware PTW engine are compatible with larger page sizes, the L1 TLB just uses 4 KiB entries. In contrast, our lightweight SVM design can remap an entire contiguous section of arbitrary size with a single entry in the L1 TLB (L-IOMMU, Hybrid). There is thus never a TLB miss for accesses to the trees which gives near-ideal performance for RFT.

7.4 Summary

In this chapter, we have analyzed the suitability of the full-fledged, hard- macro IOMMUs featured by some next-generation, high-end HESoCs for SVM [27]. Similar to the IOMMUs found in today’s high-end desktop processors [7,8] and used to, e.g., enable SVM for embedded or discrete GPGPUs [5,6,10,11], these hardware blocks are highly flexible and would allow for SVM between host and various (I/O) devices and accelerators. However, studying the associated software stacks integrated into the Linux kernel reveals severe limitations which prevent them from being easily used for efficiently sharing user-space virtual memory between host and accelerators. In fact, the standard software distributions and programming models for these HESoCs currently not foresee to use these IOMMUs for this purpose. Instead, they are 168 CHAPTER 7. FULL-FLEDGED IOMMUS FOR SVM rather used according to their original purpose, i.e., the protection of the host system from malicious or faulty DMA devices and drivers. Adapting the software stack of our SVM framework, it can also be used to enable SVM using such full-fledged, hard-macro IOMMUs. We have used a set of parameterized application kernels extracted from real-world applications to evaluate the performance of the resulting SVM framework. Our results demonstrate that, while such hardware IOMMUs can indeed be used to enable SVM, the availability of full- fledged hardware itself is not sufficient for high performance. Without additional modifications to the IOMMU framework of the OS kernel and the low-level hardware driver, the IOMMU cannot operate on the PT of the accelerated user-space process. Instead, a separate IOVA space and PT is used, to which the host must explicitly map shared memory pages. In many application scenarios, the handling of such page faults severely limits the performance of the PMCA to between 30% and 60% of what is achievable when the IOMMU can directly use the PT of the user-space process. Moreover, we have compared this design for the use in a PMCA with the lightweight, mixed hardware-software SVM framework comprising the PMCA cluster hardware and compiler extension presented in Chap. 4, the VMM software library for on-accelerator VM management presented in Chap. 5, and the hybrid, high-capacity multi-cycle TLB presented in Chap. 6. Compared to the full-fledged, hardware IOMMU operating directly on the user-space PT, our lightweight design achieves a relative performance between 75% and 100% in many, non-strictly memory-bound scenarios. Moreover, if the application makes use of physically contiguous VM, our design even outperforms the full- fledged hardware and achieves near-ideal SVM performance. This demonstrates that lightweight SVM frameworks as proposed in this thesis are not only suitable to improve the programmability and offload performance in area- and power-constrained HESoCs, but that such designs are a viable option to full-fledged hardware solutions in high-end HESoCs, as they offer competitive performance at low hardware cost. Chapter 8

Conclusions

Modern HESoCs rely on combinations of feature-rich, general-purpose, multi-core host processors and massively-parallel PMCAs to achieve high flexibility, energy efficiency and peak performance. Due to com- plex memory systems resulting in partitioned memory models between host and accelerators, the task of effectively using these platforms and exploiting the nominally extremely high performance/Watt is a major challenge that is nowadays entirely left in the hands of the application programmers.

In this thesis, we investigated the design of mixed hardware-software frameworks for enabling transparent, lightweight, zero-copy shared virtual memory (SVM) in power- and area-constrained HESoCs. With such a framework, sharing data between host and PMCA becomes as simple as passing virtual address pointers as seen by the application started on host to the PMCA, in the same way that shared memory, parallel programs pass pointers between threads running on a CPU. Having the same view on the user-space virtual memory as the host, the PMCA itself can fetch the shared data from main memory without relying on the host for data management or using specialized memory allocators and the like. As such, SVM greatly simplifies programmability and improves the performance of HESoCs.

169 170 CHAPTER 8. CONCLUSIONS

8.1 Overview of the Main Results

The main results and contributions can be summarized as follows.

Lightweight SVM for Regular Memory Access Patterns Today’s state-of-the-art heterogeneous programming models primarily focus on data-parallel accelerator models. Under these paradigms, offload-based applications are typically structured as a set of regular loops containing predictable trip count and regular, streaming-type access pattern to main memory. We designed a first, lightweight SVM system that exploits these properties to enable SVM at low hardware cost and high performance. This design is based on a hardware IOTLB that is efficiently managed in software by a kernel-level driver module and a user-space runtime running on the host. Using an IOTLB double-buffering scheme allows to overlap the host interrupt latency and IOTLB reconfiguration with actual data transfers and accelerator execution. By adjusting the granularity of this double-buffering scheme, the design allows to trade off hardware resources and synchronization overhead. Using an IOTLB with just 32 entries, this solution allows a PMCA to operate at full speed down to operational intensities as low as 19.2 operations per byte.

Sharing Pointer-Rich Data Structures To support the heterogeneous execution of pointer-chasing applications exhibiting irregular memory access patterns, we developed a suitable SVM framework that allows for zero-copy sharing of complex, pointer- rich data structures. This framework is based on lightweight hardware extensions, which are not intrusive to the PMCA cores and the host processor, and a compiler extension that automatically instruments the PMCA’s accesses to SVM. To perform the virtual-to-physical address translation for the accesses of the PMCA to SVM, a hardware IOTLB is used that is managed by PMCA-side helper threads through a VMM software library. Operating cache-coherently on the PT of the offloaded user-space process, our design allows the PMCA to autonomously manage its VM hardware without host intervention. This greatly reduces overhead with respect to host-side VMM solutions 8.1. OVERVIEW OF THE MAIN RESULTS 171 while retaining flexibility. The framework offers the possibility for collaborative IOTLB management, e.g., to exploit application-level knowledge available at offload time on the host side. Moreover, since the VMM software library for the PMCA operates on the Linux representation of the PT, it can be adapted for all host architectures supported by the Linux kernel. This framework allows for substantially improved performance and programmability compared to copy-based offloading which is still the state of the art for embedded systems. Compared to an ideal SVM system, which is out of reach even for today’s full-fledged hardware designs found in high-end desktop processors and HPC systems, the performance of our framework lies within 50% for purely memory-bound application kernels and within 5% for real applications.

IOMMU Design for FPGA Accelerators Motivated by the observation that in many cases, the traffic of parallel accelerators is more bandwidth-sensitive than latency-sensitive, we investigated the design of alternative TLB architectures tailored to the needs of PMCAs. In particular, we propose a hybrid TLB design that combines the best of two different TLB architectures. A small, fully-associative L1 TLB with single-cycle look-up latency and entries of flexible size is used to efficiently support techniques targeting at reducing the host TLB miss rate such as transparent huge pages and CMA. To reduce the overall TLB service time for regular memory pages at low hardware cost, it is combined with a new, set-associative, multi-cycle L2 TLB that is scalable and maps well to FPGAs. Compared to related works addressing SVM for FPGA accelerators, this design allows to increase the TLB capacity by factors of 16x and more while achieving lower overall resource utilization and higher or comparable clock frequencies. An in-depth analysis using parameter- ized benchmarks extracted from real-world applications showed that the higher TLB look-up latency has a negligible effect on performance, and that cache-coherent accelerator accesses to SVM are not always beneficial. For optimal performance and flexibility, our design allows the application programmer to optionally control cache coherency on an address range or shared data element basis. Unlike other works, our design further supports multiple outstanding TLB misses, the handling 172 CHAPTER 8. CONCLUSIONS of multiple misses in batch mode, and TLB prefetching transactions, which has a beneficial effect on performance for parallel accelerators. In combination with a host user-space runtime library and a kernel-level driver module to manage the hybrid TLB, the proposed design can serve as a configurable, plug-and-play IOMMU framework for exploring transparent SVM for custom FPGA accelerators.

SVM Using Full-Fledged Hardware IOMMUs Finally, we analyzed the suitability of full-fledged, hard-macro IOM- MUs featured by some next-generation, high-end HESoCs for SVM. Similar to the IOMMUs found in high-end desktop processors, these hardware blocks are highly flexible and would allow for SVM between host and various (I/O) devices and accelerators. However, limitations in the associated software frameworks prevent them from being easily used for sharing user-space virtual memory between host and accelera- tors. In fact, the standard software distributions and programming models currently rather foresee the use of these IOMMUs according to their original purpose, i.e., the isolation of the host system from malicious or faulty DMA devices and drivers. For this reason, we adapted our SVM framework to also enable SVM using such full-fledged, hard-macro IOMMUs. Using a set of parameterized application kernels extracted from real-world applications, we then compared the SVM framework using such a full-fledged hardware IOMMU with the lightweight, mixed hardware-software SVM framework developed throughout this thesis. Our results show that the inefficiencies imposed by today’s software stacks reduce the performance of full-fledged IOMMU hardware for SVM by 40% to 70% in typical application scenarios. In contrast, the performance of our lightweight SVM framework lies within 25% of an ideal system in many, non-strictly memory-bound scenarios and can even outperform full-fledged hardware IOMMUs with optimized software stacks. Our results demonstrate that lightweight, zero-copy, mixed hardware-software SVM frameworks as proposed in this thesis are not only suitable to improve the programmability and offload performance in HESoCs, but that such designs are a viable option to full-fledged hardware solutions in high-end HESoCs, as they offer competitive performance at low hardware cost. 8.2. OUTLOOK 173

8.2 Outlook

Lightweight, zero-copy, mixed hardware-software SVM frameworks as proposed in this thesis allow to substantially improve the pro- grammability and offload performance for area- and power-constrained HESoCs at low hardware cost. This is especially true for applications featuring irregular main memory access patterns and that operate on complex, pointer-rich data structures. In the past, the lack of SVM support has proven completely prohibitive for efficient heterogeneous implementations of such applications at reasonable design-time effort. Compared to hardware-only solutions for SVM, mixed hardware- software designs offer much greater flexibility. Using PMCA-side helper threads allows to dynamically allocate more resources to VM management during critical application phases without the need for costly overprovisioning as seen in fully hardware-managed solutions designed for worst-case scenarios. Moreover, software-managed VM opens up the possibility for collaborative management schemes between host and PMCA based on application-level knowledge. Exploiting higher-level information is key for building effective TLB prefetching solutions of superior accuracy and applicable over a wider application spectrum than what can be achieved with today’s simple hardware prefetchers [41, 95, 109]. To further ease programming and improve performance of PMCA- based HESoCs, lightweight SVM should be combined with frameworks for automated DMA and SPM management [20] and heterogeneous compile toolchains including offload support [120, 121]. With such a framework, the application programmer can write a single high- level application without the need for special memory allocators and manual DMA programming. This allows to improve overall system programmability beyond what is possible in accelerator programming today using for example GPGPUs.

Appendix A

Pointer-Chasing Application Descriptions

This appendix provides detailed descriptions of the real-life applications used in Chap. 4 and Chap. 5 and the adopted parallelization schemes. PageRank (PR): This algorithm was originally used by Google to rank web sites [82]. Every web site is represented by a vertex, and a link from one site to another is represented by an arc between the two corresponding vertices. The graph is initialized by equally ranking all vertices. The algorithm then iteratively processes the graph. The rank of every vertex is divided by the number of successor vertices and added to their rank. At the end of every iteration, the rank of dangling vertices is equally distributed to all vertices, and all ranks are normalized. The procedure is repeated until the ranks converge. PR is an ideal candidate to use an adjacency list to represent the highly irregular graph, which is not altered during processing. We started from an open-source, floating-point, C++ implemen- tation [122], converted it to C using fixed-point arithmetic only, and derived an implementation for PULP parallelized using OpenMP. The application is started on the host which parses a text file and builds up the graph in virtual memory. The initialization as well as the processing is then done on PULP, which just gets the virtual address pointer to the start of the adjacency list and the number of vertices

175 176 APPENDIX A. POINTER-CHASING APPLICATIONS from the host. Due to the size of the graph (hundreds to several millions of vertices1) and the access pattern, which highly depends on the graph itself, it is not feasible to setup the RAB in advance or to apply the RAB double-buffering schemes discussed in Chap. 3. Since the number of computations performed in every vertex is low, basically a single division and one addition per successor, PR is highly communication intensive. Furthermore, it features low locality of reference and therefore represents a worst-case scenario. Parallelization of PR using OpenMP is achieved on a vertex level. Random Hough forests (RHFs): The second benchmark ap- plication is from the machine learning domain and is visualized in Fig. A.1 a). It is the classification stage of an object detector using random Hough forests [83], i.e., a set of binary decision trees. To detect the bounding boxes of instances of a class in an image, the application computes the corresponding Hough image using RHFs. To this end, image patches of a fixed size (16 × 16 pixels) are extracted from the input image and fed to the root node of every classification tree. Every node contains a descriptor consisting of the coordinates of 2 pixels, a channel index and threshold to perform a simple binary test on the patch. If the difference between the intensities of the 2 pixels is smaller than the threshold, the image patch proceeds to the left child node, otherwise to the right child node. Once the patch arrives at a leaf node, the value of that leaf node, which equals the proportion of object patches that arrived at this node in the training phase and that correspond to the specific class, is added to every pixel of the patch in the Hough image. Next, the patch is shifted in the input image and again fed to the classification trees. Once all patches have been classified, a Gaussian filter is applied to the Hough image. The detection hypotheses are found at the maxima locations, and the values at these locations serve as confidence measures. We started from an open-source C++ implementation [123], which we cross-compiled for the host together with OpenCV [77]. This implementation applies several image filtering steps to the RGB input image to extract a total of 32 feature channels 1 and 2 . On PULP, we implemented the last step of the feature extraction as well as

1Due to the conversion to fixed-point arithmetic and the reduced dynamic range, our implementation of PR is limited to graphs sizes in the order of 10k to 100k vertices. 177

binary test: coordinates a) 2 x1,y1,x2,y2 1 3 feature channel threshold 16 32 feature channels

weak: position size n trees type threshold 2 face b) found 3 1 abort abort abort

abort abort abort n cascades

Figure A.1: Pointer-chasing applications: a) RHFs and b) FD. the classification stage, both operating on a patch basis and well parallelizable using OpenMP. From the host, PULP gets pointers to 16 feature channels, to the root nodes of the classification trees, and to an array collecting the results. Since, the exact access pattern to the feature channels is already known at offload time, RAB double buffering as discussed in Chap. 3 can be used to copy the 16 feature channels from shared memory into the L1 SPM using DMA transfers. After computing the 32 feature channels 2 , the patch is fed to the classification trees in parallel 3 . The binary classification trees produced by the training stage are highly regular and can be stored in a table, similar to the one shown in Fig. 5.8 b). The address of the two child nodes can always be computed from the current node index. Since the two child nodes are always adjacent in the table, a DMA-assisted table prefetching scheme can be applied: Whenever, the patch arrives at a new node, its two child nodes are copied from the shared main memory into the local L1 SPM. Face detection (FD): This application is also from the machine learning domain and is visualized in Fig. A.1 b). It uses the well-known Viola-Jones object detection framework [84]. To detect a face in a particular location in an input image, the corresponding image patch is fed to a degenerate decision tree, a so-called cascade. Per node, one or multiple weak classifiers (weaks) are computed. Every weak specifies a simple test to perform on the patch and a threshold. If the weighted sum of the outputs of the weaks is below a node-specific threshold, it is very likely that the patch does not contain a face and the detection is 178 APPENDIX A. POINTER-CHASING APPLICATIONS aborted. Otherwise, the patch is fed to the next node, where the same procedure is repeated with different weaks. The cascades are designed such as to reject negative patches with as little computation as possible, i.e., as early as possible in the cascade. The individual classifiers are trained to have a detection rate close to 100%, while the false positive rate can be fairly high. The overall high detection accuracy is reached by cascading multiple weaks and cascades. The features themselves are based on simple rectangular features, which involve the computation of the difference of the sum of intensities of 2, 3 or 4 adjacent rectangles of the image patch. To speed up the computation of the features, they are extracted from the integral image, where every pixel contains the sum of all pixels above and to the left. For example, the extraction of a two-rectangle feature simplifies to accessing just 6 pixels and doing 7 additions, irrespective of the size of the rectangles. We started from a C implementation of FD for embedded systems. The application is started on the host, which passes pointers to the main data structure holding the configuration of the detector, including pointers to the cascades, and pointers to the input image to PULP. On PULP, all accesses to shared data structures are effectively protected using calls to the tryx() functions. The input image is fetched from shared main memory in vertical stripes using DMA transfers 1 . Then, the integral image as well as the squared integral image, which is required for normalization, is computed on PULP 2 . Finally, shifted patches are fed to the classification stage in parallel 3 . The order in which the weaks are accessed in the cascade is the same for all patches. Since most of the patches are rejected at some point, the resulting access pattern to the cascades is very irregular. Due to the degenerate nature of the decision tree, it can be stored in a table efficiently which can also be prefetched using DMA as for RHFs. The final part of the algorithm is executed on the host and collects all results from PULP and combines multiple detections if necessary. Acronyms

ACE AXI Coherency Extensions ACP Accelerator Coherency Port API application programming interface APU application processing unit ASIC application-specific integrated circuit AXI Advanced eXtensible Interface Bus

LSB least significant bit BRAM block random-access memory

CAM content-addressable memory CCR communication-to-computation ratio CISR Condensed Interleaved Sparse Representation CMA contiguous memory allocator CMP chip multiprocessor CMW Concurrency Managed Workqueue CPU central processing unit CSC color structure code CT color-based tracking CU compute unit

DDPA doomsday productivity amplifier DDR double data rate

179 180 ACRONYMS

DEQ dequantization DMA direct memory access DRAM dynamic random-access memory DSP digital signal processor DVM distributed virtual memory

FD face detection FF flip-flop FIFO first-in, first-out FIFO first-in, first-out buffer FPGA field-programmable gate array

GPGPU general-purpose GPU graphics processing unit

HAL hardware abstraction layer HERO Heterogeneous Embedded Research Platform HLS high-level synthesis HOG histogram of oriented gradients HP high-performance HPC high-performance computing HPC high-performance coherent HSA Heterogeneous System Architecture HUM hit under miss

IDCT inverse discrete cosine transform I/O input/output IOMMU input/output memory management unit IOTLB input/output translation lookaside buffer IOVA input/output virtual address IP intellectual property

L1 level-one L2 level-two ACRONYMS 181

LFU least frequently used LLC last-level cache LMB local memory bus LPAE Large Physical Address Extension LRU least recently used LUT look-up table

MC memory copy MH miss handling MHT miss-handling thread MIMD multiple-instruction multiple-data MJPEG motion JPEG MMU memory management unit

NCC normalized cross-correlation

OS operating system

PA physical address PC pointer chasing PE processing element PFN page frame number PGD Page Global Directory PL programmable logic PMCA programmable many-core accelerator PMD Page Middle Directory POSIX Portable Operating System Interface PR PageRank PS programmable system PT page table PTE Page Table Entry PTW page table walk PGD Page Upper Directory PULP Parallel Ultra-Low Power Processing Platform 182 ACRONYMS

RAB remapping address block RAM random-access memory RFT random forest traversal RGB red, green and blue RHF random Hough forest ROD removed object detection ROI region of interest

SIMD single-instruction multiple-data SMMU system memory management unit SMP symmetric multiprocessing SMVM sparse matrix-vector multiplication HESoC heterogeneous embedded system on chip SoC system on chip SPM scratchpad memory SVM shared virtual memory SVM support vector machines

TBU translation buffer unit TCU translation control unit TLB translation lookaside buffer

VA virtual address VGA video graphics array VM virtual memory VMM virtual-memory management Bibliography

[1] G. Kyriazis, “Heterogeneous system architecture: A technical review,” technical review, 2012.

[2] J. Stuecheli, B. Blaner, C. Johns, and M. Siegel, “CAPI: A coherent accelerator processor interface,” IBM Journal of Research and Development, vol. 59, no. 1, pp. 7:1–7:7, 2015.

[3] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Con- stantinides et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proc. ACM/IEEE Int. Symp. on (ISCA), 2014, pp. 13–24.

[4] B. Klauer, The Convey Hybrid-Core Architecture. Springer New York, 2013, pp. 431–451.

[5] AMD Inc., “AMD Compute Cores,” white paper, 2014.

[6] Intel Corp., “The compute architecture of Intel Processor Graphics Gen9,” white paper, 2015.

[7] AMD Inc., “AMD I/O virtualization technology (IOMMU) specification,” architecture specification, 2016.

[8] Intel Corp., “Intel virtualization technology for directed I/O,” architecture specification, 2017.

[9] ARM Ltd., “ARM system memory management unit architecture specification,” architecture specification, 2016.

183 184 BIBLIOGRAPHY

[10] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for address translation on GPUs,” in Proc. ACM Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 743–758.

[11] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 address translation for 100s of GPU lanes,” in Proc. IEEE Int. Symp. on High Perf. Computer Architecture (HPCA), 2014, pp. 568–578.

[12] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann et al., “Heterogeneous system coherence for integrated CPU- GPU systems,” in Proc. Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO), 2013, pp. 457–467.

[13] Qualcomm Inc., “Qualcomm Snapdragon 820 mobile processor,” product brief, 2016.

[14] NVIDIA Corp., “NVIDIA Tegra X1,” white paper, 2015.

[15] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley et al., “Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications,” in Proc. ACM/EDAC/IEEE Design Automation Conference (DAC), 2012, pp. 1137–1142.

[16] Adapteva Inc., “Parallella reference manual,” technical reference manual, 2014.

[17] Texas Instruments Inc., “Multicore DSP+ARM KeyStone II system-on-chip (SoC),” technical reference manual, 2017.

[18] Kalray S.A., “MPPA MANYCORE,” product flyer, 2014.

[19] Plurality, “The HyperCore architecture,” white paper, 2010.

[20] G. Tagliavini, G. Haugou, and L. Benini, “Optimizing memory bandwidth in OpenVX graph execution on embedded many-core accelerators,” in Proc. Conf. on Design and Architectures for Signal and Image Processing (DASIP), 2014, pp. 1–8. BIBLIOGRAPHY 185

[21] C. Pinto and L. Benini, “A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters,” in Proc. IEEE Int. Conf. on Application-Specific Systems, Architectures and Processors (ASAP), 2013, pp. 281–288.

[22] Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman et al., “A quan- titative analysis on microarchitectures of modern CPU-FPGA platforms,” in Proc. ACM/EDAC/IEEE Design Automation Conference (DAC), 2016, pp. 109:1–109:6.

[23] A. Kegel, P. Blinzer, A. Basu, and M. Chan, “Virtualizing IO through the IO memory management unit (IOMMU),” tutorial at ASPLOS 2016 Conf., 2016.

[24] O. Peleg, A. Morrison, B. Serebrin, and D. Tsafrir, “Utilizing the IOMMU scalably,” in Proc. USENIX Annual Technical Conf., 2015, pp. 549–562.

[25] M. Malka, N. Amit, M. Ben-Yehuda, and D. Tsafrir, “rIOMMU: Efficient IOMMU for I/O devices that employ ring buffers,” in Proc. ACM Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015, pp. 355–368.

[26] Khronos OpenCL Working Group. (2015) The OpenCL specifi- cation version: 2.0. language specification.

[27] Xilinx Inc., “Zynq UltraScale+ MPSoC data sheet: Overview,” advance product specification, 2017.

[28] ARM Ltd., “ARM CoreLink MMU-500 system memory manage- ment unit,” technical reference manual, 2016.

[29] Xilinx Inc., “SDSoC environment user guide,” user guide, 2017.

[30] M. Nazarewicz. (2012) A deep dive into CMA. LWN article. http://lwn.net/Articles/486301/.

[31] S. Park, M. Kim, and H. Y. Yeom, “GCMA: Guaranteed contiguous memory allocator,” ACM SIGBED Review, vol. 13, no. 1, pp. 29–34, 2016. 186 BIBLIOGRAPHY

[32] J. Corbet. (2015) Fixing the contiguous memory allocator. LWN article. http://lwn.net/Articles/636234/. [33] A. Kurth, A. Tretter, P. A. Hager, S. Sanabria, O. G¨oksel et al., “Mobile ultrasound imaging on heterogeneous multi-core platforms,” in Proc. IEEE/ACM Symp. on Embedded Systems for Real-Time Multimedia (ESTIMedia), 2016, pp. 9–18. [34] P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo et al., “A high- efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC,” in Proc. Int. Conf. on ReConFigurable Computing and FPGAs (ReConFig), 2016, pp. 1–8. [35] P. Vogel, A. Marongiu, and L. Benini, “An evaluation of memory sharing performance for heterogeneous embedded SoCs with many-core accelerators,” in Proc. Int. Workshop on Code Optimisation for Multi and Many Cores (COSMIC), 2015, pp. 6:1–6:9. [36] P. Vogel, A. Marongiu, and L. Benini, “Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs,” in Proc. IEEE/ACM Int. Conf. on Hardware/- Software Codesign and System Synthesis (CODES+ISSS), 2015, pp. 45–54. [37] P. Vogel, A. Marongiu, and L. Benini, “Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs,” IEEE Trans. on Parallel and Distributed Systems, vol. 28, no. 7, pp. 1947–1959, 2017. [38] P. Vogel, A. Kurth, J. Weinbuch, A. Marongiu, and L. Benini, “Efficient virtual memory sharing via on-accelerator page table walking in heterogeneous embedded SoCs,” ACM Trans. on Embedded Computer Systems, vol. 16, no. 5s, pp. 154:1–154:19, 2017. [39] P. Vogel, A. Marongiu, and L. Benini, “Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU,” IEEE Trans. on Computers, submitted for publication. BIBLIOGRAPHY 187

[40] A. Kurth, P. Vogel, A. Capotondi, A. Marongiu, and L. Benini, “HERO: Heterogeneous embedded research platform for exploring RISC-V manycore accelerators on FPGA,” 2017, https://arxiv. org/abs/1712.06497. [41] A. Kurth, P. Vogel, A. Marongiu, and L. Benini, “Scalable and efficient virtual memory sharing in heterogeneous SoCs with TLB prefetching and MMU-aware DMA engine,” 2018, https://arxiv.org/abs/1808.09751. [42] P. Vogel, A. Bartolini, and L. Benini, “Efficient parallel beamforming for 3D ultrasound imaging,” in Proc. ACM/IEEE Great Lakes Symposium on VLSI (GLSVLSI), 2014, pp. 175–180. [43] P. Hager, P. Vogel, A. Bartolini, and L. Benini, “Assessing the area/power/performance tradeoffs for an integrated fully- digital, large-scale 3D-ultrasound beamformer,” in Proc. IEEE Biomedical Circuits and Systems Conf. (BioCAS), 2014, pp. 228–231. [44] S. Srikantaiah and M. Kandemir, “SRP: Symbiotic resource partitioning of the memory hierarchy in CMPs,” in High Performance Embedded Architectures and , ser. Lecture Notes in . Springer Berlin Heidelberg, 2010, pp. 277–291. [45] A. Gupta, “Software and hardware techniques for mitigating the multicore interference problem,” Ph.D. dissertation, University of California, San Diego, 2013. [46] M. Jahre, “Managing shared resources in chip multiprocessor memory systems,” Ph.D. dissertation, Norwegian University of Science and Technology, 2010. [47] F. Liu and Y. Solihin, “Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors,” in Proc. ACM Int. Conf. on Measurement and Modeling of Computer Systems (SIGMETRICS), 2011, pp. 37–48. [48] M. Daga, A. M. Aji, and W. Feng, “On the efficacy of a fused CPU+GPU processor (or APU) for ,” in 188 BIBLIOGRAPHY

Proc. Symp. on Application Accelerators in High-Performance Computing (SAAHPC), 2011, pp. 141–149. [49] K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth et al., “The tradeoffs of fused memory hierarchies in heterogeneous com- puting architectures,” in Proc. ACM Int. Conf. on Computing Frontiers (CF), 2012, pp. 103–112. [50] J. Hestness, S. W. Keckler, and D. A. Wood, “GPU computing pipeline inefficiencies and optimization opportunities in hetero- geneous CPU-GPU processors,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2015, pp. 87–97. [51] M. Sadri, C. Weis, N. Wehn, and L. Benini, “Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ,” in Proc. FPGAworld Conf., 2013, pp. 5:1–5:8. [52] V. Sklyarov, A. Skliarova, and A. Sudnitson, “Fast matrix covering in all programmable systems-on-chip,” Elektronika ir Elektrotechnika, vol. 20, no. 5, pp. 150–153, 2014. [53] E. S. Chung, J. D. Davis, and J. Lee, “LINQits: Big data on little clients,” in Proc. ACM/IEEE Int. Symp. on Computer Architecture (ISCA), 2013, pp. 261–272. [54] Digilent Inc., “ZedBoard,” hardware user’s guide, 2014. [55] Xilinx Inc., “Zynq-7000 All Programmable SoC overview,” product specification, 2016. [56] Xilinx Inc., “MicroBlaze processor,” reference guide, 2013. [57] Xilinx Inc., “LogiCORE IP LMB BRAM interface controller v4.0,” product guide for Vivado Design Suite, 2013. [58] NVIDIA Corp., “NVIDIA Tegra K1 mobile processor,” technical reference manual, 2014. [59] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge et al., “MiBench: A free, commercially representative embedded benchmark suite,” in Proc. IEEE Int. Workshop on Workload Characterization (WWC), 2001, pp. 3–14. BIBLIOGRAPHY 189

[60] M. Li, R. Sasanka, S. V. Adve, Y. Chen, and E. Debes, “The ALPBench benchmark suite for complex multimedia applica- tions,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2005, pp. 34–45.

[61] A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini, “Im- proving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP,” in Proc. ACM Int. Workshop on Manycore Embedded Systems (MES), 2013, pp. 1–8.

[62] ARM Ltd., “ARM CoreLink cache coherent network family,” IP core family overview, 2017.

[63] B. Forsberg, A. Marongiu, and L. Benini, “GPUguard: Towards supporting a predictable execution model for heterogeneous SoC,” in Proc. IEEE/ACM Design, Automation and Test in Europe Conf. and Exhibition (DATE), 2017, pp. 318–321.

[64] G. Kornaros, K. Harteros, I. Christoforakis, and M. Astrinaki, “I/O virtualization utilizing an efficient hardware system-level memory management unit,” in Int. Symp. on System-on-Chip (SoC), 2014, pp. 1–4.

[65] P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni, “Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip,” in Proc. IEEE/ACM Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2016, pp. 3:1–3:10.

[66] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand et al., “Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation,” in Proc. IEEE Int. Conf. on Computer Design (ICCD), 2016, pp. 25–32.

[67] N. Estibals, G. Deest, A. H. El Moussawi, and S. Derrien, “System level synthesis for virtual memory enabled hardware threads,” in Proc. IEEE/ACM Design, Automation and Test in Europe Conf. and Exhibition (DATE), 2016, pp. 738–743. 190 BIBLIOGRAPHY

[68] S. Tavarageri, J. Ramanujam, and P. Sadayappan, “Adaptive parallel tiled code generation and accelerated auto-tuning,” Int. Journal of High Performance Computing Applications, vol. 27, no. 4, pp. 412–425, 2013. [69] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for TLB prefetching: An application-driven study,” in Proc. ACM/IEEE Int. Symp. on Computer Architecture (ISCA), 2002, pp. 195–206. [70] A. Saulsbury, F. Dahlgren, and P. Stenstr¨om,“Recency-based TLB preloading,” in Proc. ACM/IEEE Int. Symp. on Computer Architecture (ISCA), 2000, pp. 117–127. [71] D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini et al., “Energy efficient parallel computing on the PULP platform with support for OpenMP,” in Proc. IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI), 2014, pp. 1–5. [72] D. Rossi, I. Loi, G. Haugou, and L. Benini, “Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters,” in Proc. ACM Int. Conf. on Computing Frontiers (CF), 2014, pp. 15:1–15:10. [73] Digilent Inc., “Xilinx Zynq Mini-ITX development kit,” User guide, 2015. [74] S. Benatti, B. Milosevic, F. Casamassima, P. Sch¨onle,P. Bunjaku et al., “EMG-based hand gesture recognition with flexible analog front end,” in Proc. IEEE Biomedical Circuits and Systems Conf. (BioCAS), 2014, pp. 57–60. [75] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886–893. [76] M. Magno, F. Tombari, D. Brunelli, L. Di Stefano, and L. Benini, “Multimodal abandoned/removed object detection for low power video surveillance systems,” in Proc. IEEE Int. Conf. on Advanced Video and Signal based Surveillance (AVSS), 2009, pp. 188–193. BIBLIOGRAPHY 191

[77] Willow Garage. (2014) OpenCV: Open source computer vision. open-source software library.

[78] S. Manegold. Calibrator v0.9e. open-source software application. http://homepages.cwi.nl/∼manegold/.

[79] B. Wile, “Coherent accelerator processor proxy (CAPI) on POWER8,” presentation at Enterprise 2014 Conf., 2014.

[80] V. Mirian and P. Chow, “Evaluating shared virtual memory in an OpenCL framework for embedded systems on FPGAs,” in Proc. Int. Conf. on ReConFigurable Computing and FPGAs (ReConFig), 2015, pp. 1–8.

[81] OpenMP Architecture Review Board. (2013) OpenMP applica- tion program interface version 4.0. API specification.

[82] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proc. Int. Conf. on World-Wide Web 7, 1998, pp. 107–117.

[83] J. Gall and V. Lempitsky, “Class-specific Hough forests for object detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1022–1029.

[84] P. Viola and M. Jones, “Robust real-time face detection,” Int. Journal on Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.

[85] M. Lavasani, H. Angepat, and D. Chiou, “An FPGA-based in-line accelerator for memcached,” IEEE Computer Architecture Letters, vol. 13, no. 2, pp. 57–60, 2014.

[86] H. C. Ng, Y. M. Choi, and H. K. H. So, “Direct virtual memory access from FPGA for high-productivity heterogeneous computing,” in Proc. IEEE Int. Conf. on Field-Programmable Technology (FPT), 2013, pp. 458–461.

[87] J. Cong, Z. Fang, Y. Hao, and G. Reinman, “Supporting address translation for accelerator-centric architectures,” in Proc. IEEE Int. Symp. on High Perf. Computer Architecture (HPCA), 2017, pp. 37–48. 192 BIBLIOGRAPHY

[88] R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo et al., “Virtual-to-physical address translation for an FPGA- based interconnect with host and GPU remote DMA capabilities,” in Proc. IEEE Int. Conf. on Field-Programmable Technology (FPT), 2013, pp. 58–65. [89] A. Agne, M. Platzner, and E. L¨ubbers, “Memory virtualization for multithreaded reconfigurable hardware,” in Proc. Int. Conf. on Field Programmable Logic and Applications (FPL), 2011, pp. 185–188. [90] Y. Li, R. Melhem, and A. K. Jones, “PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs,” ACM Trans. on Architecture and Code Optimization, vol. 9, no. 4, pp. 28:1–28:21, 2013. [91] ARM Ltd., “AMBA AXI and ACE protocol specification,” protocol specification, 2013. [92] Intel Corp., “Arria 10 device overview,” product specification, 2016. [93] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella et al., “How well do graph-processing platforms perform? An empirical performance evaluation and analysis,” in Proc. IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), 2014, pp. 395–404. [94] ARM Ltd., “Cortex-A9 floating-point unit,” technical reference manual, 2012. [95] A. Kurth, “Smart virtual memory sharing,” M.Sc. thesis, ETH Z¨urich, 2017. [96] F. Winterstein and G. Constantinides, “Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs,” in Proc. IEEE Int. Conf. on Field-Programmable Technology (FPT), 2017, pp. 1–8. [97] IBM Corp., “Coherent accelerator processor interface user’s manual,” User’s Manual, 2015, http://www.nallatech.com/wp- BIBLIOGRAPHY 193

content/uploads/CoherentAcceleratorProcessorInterface UG v12 29JAN2015 pub.pdf.

[98] H. Giefers and M. Platzner, “An FPGA-based reconfigurable mesh many-core,” IEEE Trans. on Computers, vol. 63, no. 12, pp. 2919–2932, 2014.

[99] H. Khdr, S. Pagani, E.´ Sousa, V. Lari, A. Pathania et al., “Power density-aware resource management for heterogeneous tiled multicores,” IEEE Trans. on Computers, vol. 66, no. 3, pp. 488–501, 2017.

[100] AMD Inc., “2nd generation AMD Embedded R-Series APU,” product brief, 2017.

[101] NVIDIA Corp., “NVIDIA Jetson TX2 delivers twice the intelligence to the edge,” Parallel Forall blog, 2017, https://devblogs.nvidia.com/parallelforall/jetson-tx2-delivers- twice-intelligence-edge/.

[102] Microsemi Corp., “SmartFusion2 SoC FPGA,” Product brief, 2017.

[103] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge et al., “Design tradeoffs for software-managed TLBs,” ACM SIGARCH Computer Architecture News, vol. 21, no. 2, pp. 27–38, 1993.

[104] F. Shamani, V. F. Sevom, J. Nurmi, and T. Ahonen, “Design, implementation and analysis of a run-time configurable memory management unit on FPGA,” in Proc. IEEE Nordic Circuits and Systems Conf. (NORCAS), 2015, pp. 1–8.

[105] M. Dehyadegari, A. Marongiu, M. R. Kakoee, S. Mohammadi, N. Yazdani et al., “Architecture support for tightly-coupled multi-core clusters with shared-memory HW accelerators,” IEEE Trans. on Computers, vol. 64, no. 8, pp. 2132–2144, 2015.

[106] Z. Ullah, “LH-CAM: Logic-based higher performance binary cam architecture on fpga,” IEEE Embedded Systems Letters, vol. 9, no. 2, pp. 29–32, 2017. 194 BIBLIOGRAPHY

[107] P. Yiannacouras, “An automatic cache generator for Stratix FPGAs,” B.Sc. thesis, University of Toronto, 2003. [108] J. Corbet. (2011) Transparent huge pages. LWN article. http: //lwn.net/Articles/423584/. [109] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, “Observations and opportunities in architecting shared virtual memory for heterogeneous systems,” in Proc. IEEE Int. Symp. on Performance Analysis of Systems and Software (ISPASS), 2016, pp. 161–171. [110] J. Fowers, K. Ovtcharov, K. Strauss, E. Chung, and G. Stitt, “A high memory bandwidth FPGA accelerator for sparse matrix- vector multiplication,” in Proc. IEEE Annual Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), 2014, pp. 36–43. [111] F. Saqib, A. Dutta, J. Plusquellic, P. Ortiz, and M. S. Pattichis, “Pipelined decision tree classification accelerator implementation in FPGA (DT-CAIF),” IEEE Trans. on Computers, vol. 64, no. 1, pp. 280–285, 2015. [112] H. Le and V. K. Prasanna, “A memory-efficient and modular approach for large-scale string pattern matching,” IEEE Trans. on Computers, vol. 62, no. 5, pp. 844–857, 2013. [113] W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned SpMV on GPUs and multicore CPUs,” IEEE Trans. on Computers, vol. 64, no. 9, pp. 2623–2636, 2015. [114] S. Schulter, C. Leistner, P. Roth, H. Bischof, and L. Van Gool, “On-line hough forests,” in Proc. British Machine Vision Conf. (BMVC), 2011, pp. 128.1–128.11. [115] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical Review E, vol. 76, 2007. [116] P. Erd˝osand A. R´enyi, “On random graphs I.” Publicationes Mathematicae, pp. 290–297, 1959. BIBLIOGRAPHY 195

[117] T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Trans. on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1:1–1:25, 2011. [118] S. Ellis, “Memory management on embedded graphics processors,” ARM Graphics and Multimedia blog, 2013, https://community.arm.com/graphics/b/blog/posts/memory- management-on-embedded-graphics-processors/. [119] Trenz Electronic GmbH, “TE0808 TRM,” technical reference manual, 2017. [120] A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini, “Simplifying many-core-based heterogeneous SoC programming with offload directives,” IEEE Trans. on Industrial Informatics, vol. 11, no. 4, pp. 957–967, 2015. [121] A. Capotondi and A. Marongiu, “Enabling zero-copy OpenMP offloading on the PULP many-core accelerator,” in Proc. ACM Int. Workshop on Software and Compilers for Embedded Systems (SCOPES), 2017, pp. 68–71. [122] P. Louridas. (2015) An open source PageRank implementa- tion. open-source software project. https://github.com/louridas/ pagerank. [123] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempit- sky, “Hough forests for object detection, tracking, and action recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2188–2202, 2011.

Curriculum Vitae

Pirmin Vogel was born on June 10, 1986 and grew up in Entlebuch, Switzerland. He received both his BSc and MSc degrees in electrical engineering and information technology from ETH Zurich, Switzerland, in 2009 and 2013, respectively. Since then, he has been a research assistant with the Integrated Systems Laboratory of ETH Zurich. In autumn 2013, Pirmin Vogel started his PhD in the Digital Circuits and Systems Group led by Prof. Dr. Luca Benini. His research interests include heterogeneous computing architectures and embedded systems on chip with a focus on operating system, driver, runtime and programming model support for efficient and transparent accelerator programming.

197