INSTITUT F¨UR INFORMATIK Frederic Pascal Sautter

INSTITUT FUR¨ INFORMATIK der Ludwig-Maximilians-Universitat¨ Munchen¨ AFRAMEWORK FOR OBSERVING THE PARALLEL EXECUTION OF RUST PROGRAMS Frederic Pascal Sautter Masterarbeit Aufgabensteller Prof. Dr. François Bry Betreuer Prof. Dr. François Bry, Thomas Prokosch Abgabe am 07. Juni 2019 2 Erklarung¨ Hiermit versichere ich, dass ich die vorliegende Arbeit selbstandig¨ verfasst habe und keine anderen als die angegebenen Hilfsmittel verwendet habe. Munchen,¨ den 07. Juni 2019 Frederic Pascal Sautter i ii Abstract With computing hardware evolving, processors incorporate more and more processing units. As the number of simultaneously possible computations steadily increases, developing efficient and reliable parallel applications becomes essential in order to take full advantage of the additional computing power. The young programming language Rust aims to ease the development of such high-performance applications. In contrast to Fortran, C, and C++, languages common in the high-performance computing domain, Rust combines low-level control of resources with strong safety guarantees. However, existing parallel performance analysis tools such as Score-P, TAU, and Vampir lack support for Rust. This work presents the design and implementation of the tRust framework, which allows observing the execution of parallel Rust applications. This framework provides a modified Rust compiler for the automated insertion of probes into the observed program and its dependencies. A run-time library enables the transmission of observation data to a central collector application for persistent storage. This centrally collected data allows for extensive analysis of the run-time behavior of the program. The tRust framework is evaluated using three different benchmark applications, each targeting one of three common parallelism libraries in Rust. The experiments suggest good performance of the tRust framework. More important is the fact that tRust allows program- mers to precisely retrace program execution of concurrent and parallel Rust programs with libraries that implement parallel paradigms not immediately available in the established HPC programming languages Fortran, C, and C++. iii iv Zusammenfassung Durch laufende Weiterentwicklung von Rechnerhardware erhalten Prozessoren immer mehr parallel arbeitende Recheneinheiten. Da dadurch die Zahl der simultan moglichen¨ Berechnungen steigt, wird die Entwicklung von effizienten und zuverlassigen¨ parallelen Programmen unerlasslich,¨ um die zusatzliche¨ Rechenkapazitat¨ voll auszuschopfen.¨ Die junge Programmiersprache Rust versucht die Entwicklung solcher hochperformanter An- wendungen zu erleichtern. Im Gegensatz zu den Sprachen Fortran, C und C++, die bei der Programmierung von Hochleistungsrechnern vorwiegend eingesetzt werden, kombiniert Rust hardwarenahen Zugriff mit starken Sicherheitsgarantien. Jedoch, fehlt bei existieren- den parallelen Leistungsanalysewerkzeugen, Score-P, TAU und Vampir die Unterstutzung¨ fur¨ Rust. Diese Arbeit legt den Entwurf und die Umsetzung des tRust Frameworks dar, mit dem der Ablauf paralleler Rust-Anwendungen uberwacht¨ werden kann. Das Werkzeug stellt einen modifizierten Rust Compiler zur Verfugung,¨ der das automatisierte Einfugen¨ von Analyseinstruktionen in das zu uberwachte¨ Programm und dessen Abhangigkeiten¨ ermoglicht.¨ Eine Laufzeitbibliothek ermoglicht¨ das Ubertragen¨ der Uberwachungsdaten¨ an eine zentrale Sammelanwendung fur¨ die persistente Speicherung. Diese zentral abge- legten Daten konnen¨ in weiterer Folge fur¨ die ausfuhrliche¨ Analyse des Programmablaufs genutzt werden. Das vorgestellte tRust Werkzeug wird anhand von drei unterschiedlichen Testanwen- dungen beurteilt, wobei jede Anwendung auf eine von drei gelaufigen¨ parallelen Pro- grammbibliotheken zugeschnitten ist. Die Experimente legen nahe, dass tRust eine ange- messene Leistung erzielt. Wichtiger ist die Tatsache, dass Entwickler die Moglichkeit¨ erhalten, den Programmablauf von nebenlaufigen¨ und parallelen Rust Programmen, die in den ublicherweise¨ im Hochstleistungsbereich¨ eingesetzten Programmiersprachen Fortran, C und C++ nicht unmittelbar zur Verfugung¨ stehen. v vi Acknowledgments This thesis would not have been possible without the help and support of many others. First, I want to thank Prof. Dr. François Bry for his valuable advice and guidance through- out this project. The suggestions and the feedback he provided often helped to concentrate on what is essential. I want to express my great gratitude to my advisor Thomas Prokosch for his outstanding support and advice at all times. Whenever I got stuck along the way, Thomas encouraged me by providing useful suggestions and ideas. Particularly, his backing during the many meetings helped to focus on and structure my work. Special thanks go to my parents for their moral support especially in challenging times. Furthermore, I express my heartfelt gratitude to Annabel for her backing at all times. Fi- nally, I thank my friends for taking my mind off things from time to time. vii viii Contents 1 Introduction 1 2 The Rust Programming Language 3 2.1 Selected Syntax Elements . .3 2.2 Ownership and Borrowing . .6 2.3 Concurrency and Parallelism in Rust . 10 2.4 Rust Infrastructure and Compiler . 11 3 Theoretical Foundations 15 3.1 Instrumentation . 15 3.1.1 Source-based Instrumentation . 17 3.1.2 Preprocessor-based Instrumentation . 17 3.1.3 Compiler-based Instrumentation . 18 3.1.4 Wrapper Library-based Instrumentation . 18 3.1.5 Binary Instrumentation . 18 3.2 Architecture of Parallel Performance Tools . 18 3.2.1 Observer . 19 3.2.2 Inspector . 20 3.3 Parallel Libraries and Interfaces . 20 3.3.1 MPI . 20 3.3.2 Crossbeam (Rust) . 21 3.3.3 POSIX-Threads . 22 3.3.4 OpenMP . 22 3.3.5 Rayon (Rust) . 23 3.3.6 Timely Dataflow (Rust) . 24 3.4 Existing Tools . 25 3.4.1 TAU Parallel Performance System . 25 3.4.2 Vampir . 27 3.4.3 Score-P performance measurement infrastructure . 28 4 Framework Design 29 4.1 Technical Implementations of Rust Instrumentation . 30 4.1.1 Wrapper Libraries . 31 4.1.2 Compiler Plugin . 32 4.1.3 Drop-In Compiler . 33 4.2 Run-Time Instrumentation Architecture . 35 4.3 Instrumentation of Rust Source Code . 37 ix x CONTENTS 4.3.1 Import Statements . 37 4.3.2 Global Scope Instructions . 37 4.3.3 Local Scope Instructions . 38 4.3.4 Instrumentation Calls around Functions . 40 4.3.5 Instrumentation Calls around Methods . 41 4.4 Persistent Storage of Trace Data . 42 5 Experiments and Results 45 5.1 Evaluation Environment . 45 5.2 Benchmarks . 46 5.2.1 Creating Vanity Keys . 46 5.2.2 Fractal Calculation . 49 5.2.3 PageRank . 51 6 Conclusion and Future Work 55 A Usage of the tRust Framework 57 A.1 Setting up a Rustup Toolchain . 57 A.2 Description of the Configuration File . 58 B Experiments 59 Bibliography 69 CHAPTER 1 Introduction In recent years computing hardware has evolved to a point where almost every computing device contains parallel hardware. From small embedded systems to computer clusters consisting of thousands of nodes, multi-core processors have become the standard in 2019. Chip manufacturers such as Intel and AMD release processors containing up to 56 [6] or 64 [1] cores respectively. Moreover, many systems contain graphics chips, which are made up of an increasing number of compute units. The number of simultaneously possible computations constantly increases. In order to benefit from this additional computing power, appropriate software is required. Software developers need to design applications which effectively leverage parallelism. However, experience has shown that creating reliable and efficient parallel programs is difficult. Parallel computing is significantly more complex, as it introduces new classes of errors such as deadlocks and race-conditions, not possible in serial programs. These errors can be difficult to detect since they often occur under specific circumstances. Developing programming languages designed for the use in concurrent and parallel environments is one approach to enhance reliability and efficiency of parallel programs. Especially since the systems-level languages commonly used for the development of performance critical applications, in particular Fortran, C and C++ provide, little to no safety guarantees at all. For this reason, the Rust programming language was developed. Rust incorporates elements of conventional systems programming language as well as elements of modern high-level languages [7]. It combines low-level control of resources with strong safety guarantees. Rust’s expressive static type system enables the compiler to prevent many memory errors and data races. As a consequence, Rust empowers developers to take advantage of modern parallel hardware [42]. While Rust manages to prevent data races there are other errors it cannot prevent, such as bottlenecks occurring when threads are not properly synchronized. Such errors can only be detected at run-time. In order to analyze and further optimize the complex parallel run- time behavior additional tools are required. In the context of high-performance computing (HPC), there already exist sophisticated tools for the analysis of parallel and distributed execution. Tools like Score-P [37], TAU [56], HPCToolkit [18] and Vampir [48] allow the collection of detailed event data during the programs execution. Based on this data it is possible to gain insight into the programs state during execution, providing in-depth infor- mation of its run-time behavior. Accordingly, these tools assist developers in finding most of the remaining

INSTITUT F¨UR INFORMATIK Frederic Pascal Sautter

D5.1 Report on Performance and Tuning of Runtime Libraries for ARM

Scalasca User Guide

TAU, PAPI, Scalasca and Vampir

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Profiling and Tracing Tools for Performance Analysis of Large Scale Applications

Best Practice Guide Modern Processors

CUBE 4.3.4 – User Guide Generic Display for Application Performance Data

Scalasca 2.3 User Guide Scalable Automatic Performance Analysis

The Scalasca Performance Toolset Architecture

Score-P – a Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

The Arm Architecture for Exascale HPC

Large-Scale Performance Analysis of PFLOTRAN with Scalasca