Shared Virtual Memory for Heterogeneous Embedded Systems on Chip

Research Collection Doctoral Thesis Shared Virtual Memory for Heterogeneous Embedded Systems on Chip Author(s): Vogel, Pirmin Publication Date: 2018 Permanent Link: https://doi.org/10.3929/ethz-b-000292606 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Shared Virtual Memory for Heterogeneous Embedded Systems on Chip Diss. ETH No. 25085 Shared Virtual Memory for Heterogeneous Embedded Systems on Chip A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich) presented by PIRMIN ROBERT VOGEL MSc ETH EEIT born on 10.06.1986 citizen of Entlebuch LU, Switzerland accepted on the recommendation of Prof. Dr. Luca Benini, examiner Prof. Dr. Marko Bertogna, co-examiner 2018 Acknowledgments Obtaining the doctorate marks a major milestone in my career. I would like to express my gratitude to a number of people whose support stands at the basis of this achievement. First of all, I would like to thank my supervisor Prof. Luca Benini for giving me the opportunity to pursue a PhD at IIS, for guiding me throughout this challenging time but still giving me the required freedom, and for his interest in my research. Second, I would like to thank my co-examiner Prof. Marko Bertogna for reviewing my thesis and for the exciting discussion during the examination. Special thanks go to Andrea Marongiu for his guidance during the last four years, for pushing and motivating me, and for his collaboration and the many fruitful discussions. I would like to thank the glorious J69.2 office crew for lifting the morale in many mysterious ways. More precisely, I thank David Bellasi for the joint, DDPA-enhanced, late-night speech and/or noise imitation sessions, Michael Gautschi for his comprehensive study on the interplay of paper yellowing and reading probability for interesting research articles during the soccer season, and Pascal Hager—holder of the J69.2 peace medal—for his politically (in)correct analyses comprising all aspects of life and which are always to the point. Moreover, I thank Andreas Kurth for never complaining but always staying positive and productive no matter the circumstances, and for his inspiring attitude, Michael Schaffner for mastering any problem of any complexity anytime and anywhere (AAAA rating), and for raising the Käfeliinterrupt frequency to the next level, and finally Florian Glaser for his ingenious pranks and the life-sustaining external Käfelievent triggers. v vi ACKNOWLEDGMENTS Doing research at the system level requires mature subsystems, strong infrastructure and technical support. I would like to thank all the members of the PULP team for their great work. Special thanks go to Germain Haugou and Alessandro Capotondi, as well as Igor Loi for providing me with customized soft- and hardware modules and supporting me with the adaption of existing components. At IIS, I could always count on stable and reliable infrastructure and working conditions thanks to many staff members doing good work in the background. In particular, I thank Frank Gürkaynak, Beat Muheim, Christoph Wicki, Adam Feigin, HansjörgGisler and Christine Haller. I also thank all my colleagues and companions at IIS, particularly Philipp Schönle,NoéBrun, Pascale Meier, Jonathan Bösser,Lukas Cavigelli, Benjamin Weber, Michael Mühlberghuber, Anne Ziegler, Andreas Traber, Mauro Salomon, BjörnForsberg, Daniele Palossi, Andrea Bartolini, Davide Schiavone, Stefan Mach, Fabian Schuiki, Fredi Kunz and Hubert Kaeslin. Furthermore, I would like to thank all the people who—in some way—prepared me for and actually made me start this undertaking. This includes my advisors during previous projects at IIS, namely Harald Kröll, Christian Benkeser, Sandro Belfanti, Stefan Altor- fer, David Tschopp and Prof. Qiuting Huang, as well as Thomas Dellsperger, Norbert Felber and Frank Gürkaynak who counseled me and encouraged me to start a PhD. Moreover, I would like to express my gratitude to the numerous teachers, coaches and supporters in my early days for everything they have taught me. Finally, I thank my family and friends for admiring what I do and for supporting me at all times. I am very grateful to my parents for encouraging me and giving me the opportunity to go to university, and my brother Matthias for leading me the way during high school and the first years of my studies. Also, I thank Nathanael for sharing the same sense of humor, for the many, truly inspiring discussions about life and for our long-term friendship. Ultimately, I thank Flurina for all her love, for supporting me throughout so many years, and for her patience and understanding during the final phase of my PhD. Entlebuch, September 2018 Pirmin Vogel Abstract Modern embedded systems on chip (SoCs) are heavily based on heterogeneous architectures that combine feature-rich, general-purpose, multi-core host processors with massively-parallel and programmable many-core accelerators (PMCAs) to achieve high flexibility, energy efficiency and peak performance. While these heterogeneous embedded SoCs (HESoCs) are nominally capable of tremendous performance/Watt targets, the burden of effectively using them is a cumbersome task that is nowadays entirely left in the hands of the application programmers. The main challenges in traditional accelerator programming origi- nate from partitioned memory models between host and accelerators. Thanks to hardware-managed virtual memory (VM) and a multi-level cache hierarchy that together abstract away the low-level details of the memory system, the host sees memory as a flat resource. In contrast, PMCAs typically rely on physically-addressed scratchpad memories (SPMs) managed in software via explicit direct memory access (DMA) transfers for maximum energy efficiency. Sharing data between host and accelerators in a heterogeneous application thus requires the programmer to manually orchestrate data copies between virtually and physically addressed main memory sections, as well as to translate and modify virtual address pointers inside the shared data to point to the proper copy. While this may be acceptable for regular memory access patterns, it quickly becomes a performance bottleneck and is completely prohibitive for the heterogeneous implementation of applications operating on complex, pointer-rich data structures. This thesis focuses on the design of transparent, lightweight, zero- copy shared virtual memory (SVM) frameworks for HESoCs that vii viii ABSTRACT allow application programmers to simply pass virtual address pointers between host and PMCA, and thereby improve both programmability and performance. As opposed to the full-fledged hardware designs for SVM found in high-performance computing (HPC) and high-end desktop systems, we strive for mixed hardware-software solutions that are i) better suited for area- and power-constrained embedded systems, ii) less intrusive to the hardware architecture of both host and PMCA, and iii) allow for greater flexibility. We start with a first lightweight SVM system suitable for regular memory access patterns typical for today’s data-parallel accelerator models. To support applications with irregular memory access patterns, we propose a lightweight, non-intrusive hardware extension combined with a compiler extension for the PMCA, which allows for zero- copy sharing of pointer-rich data structures. Relying on PMCA-side helper threads for managing the address-translation hardware, the performance of this system lies within 50% of an idealized SVM system for purely memory-bound application kernels and within 5% for real applications. We further investigate the design of address-translation caches tailored to the needs of PMCAs and come up with a hybrid architecture that is configurable, scalable, and maps well to field- programmable gate arrays (FPGAs) (making the design also suitable for custom accelerators in FPGA-enabled HESoCs). Compared to related works, this design allows to increase the caching capacity by factors of 16x and more while achieving lower overall resource utiliza- tion and higher or comparable clock frequencies. Finally, we adapt our framework to leverage the dedicated address-translation hardware featured by some next-generation, high-end HESoCs for enabling SVM, far beyond what is achievable with standard development frameworks and software distributions. Zusammenfassung Moderne eingebettete Ein-Chip-Systeme (SoCs) basieren mehr und mehr auf heterogenen Architekturen. Diese kombinieren funktionsrei- che, universell einsetzbare, mehrkernige Hauptprozessoren mit massiv parallelen, programmierbaren, vielkernigen Rechenbeschleunigern (PM- CAs) um sowohl ein hohes Mass an Flexibilität, wie auch eine hohe Energieeffizienz und Rechenleistung zu erreichen. Während solch hete- rogene, eingebettete SoCs (HESoCs) eine nominell enorm hohe Rechen- leistung bei geringem Energieverbrauch bieten, stellt dessen Nutzung und das Erreichen dieser Rechenleistung eine grosse Herausforderung dar, welche heutzutage komplett den Applikationsprogrammierern uberlassen¨ ist. Die Hauptschwierigkeit in der traditionellen Programmierung von Rechenbeschleunigern ruhrt¨ von den grundverschiedenen Speicher- modellen der verschiedenen Prozessortypen. Durch den Einsatz von dedizierter Hardware zur Speichervirtualisierung und Verwaltung von mehrstufigen Zwischenspeicher-Hierarchien können die maschinenna- hen Details des Speichersystems abstrahiert

Shared Virtual Memory for Heterogeneous Embedded Systems on Chip

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support