Dynamically Adaptable I/O Semantics for High Performance Computing

Dynamically Adaptable I/O Semantics for High Performance Computing Dissertation zur Erlangung des akademischen Grades Dr. rer. nat. an der Fakultät für Mathematik, Informatik und Naturwissenschaften der Universität Hamburg eingereicht beim Fachbereich Informatik von Michael Kuhn aus Esslingen Hamburg, November 2014 Gutachter: Prof. Dr. Thomas Ludwig Prof. Dr. Norbert Ritter Datum der Disputation: 2015-04-08 Abstract File systems as well as libraries for input/output (I/O) offer interfaces that are used to interact with them, albeit on different levels of abstraction. While an interface’s syntax simply describes the available operations, its semantics determines how these operations behave and which assumptions developers can make about them. There are several different interface standards in existence, some of them dating back decades and having been designed for local file systems; one such representative is POSIX. Many parallel distributed file systems implement a POSIX-compliant interface to im- prove portability. Its strict semantics is often relaxed to reach maximum performance which can lead to subtly different behavior on different file systems. This, in turn, can cause application misbehavior that is hard to track down. All currently available interfaces follow a fixed approach regarding semantics, making them only suitable for a subset of use cases and workloads. While the interfaces do not allow application developers to influence the I/O semantics, applications could benefit greatly from the possibility of being able to adapt them to their requirements. The work presented in this thesis includes the design of a novel I/O interface called JULEA. It offers support for dynamically adaptable semantics and is suited specifically for HPC applications. The introduced concept allows applications to adapt the file system behavior to their exact I/O requirements instead of the other way around. The general goal is an interface that allows developers to specify what operations should do and how they should behave – leaving the actual realization and possible optimizations to the underlying file system. Due to the unique requirements of the proposed interface, a prototypical file system is designed and developed from scratch. The new I/O interface and file system prototype are evaluated using both syn- thetic benchmarks and real-world applications. This ensures covering both specific optimizations made possible by the file system’s additional knowledge as well as the applicability for existing software. Overall, JULEA provides data and metadata performance comparable to that of other established parallel distributed file systems. However, in contrast to the existing solutions, its flexible semantics allows it to cover a wider range of use cases in an efficient way. The results demonstrate that there is need for I/O interfaces that can adapt to the requirements of applications. Even though POSIX facilitates portability, it does not seem to be suited for contemporary HPC demands. JULEA presents a first approach of how application-provided semantical information can be used to dynamically adapt the file system’s behavior to the applications’ I/O requirements. Kurzfassung Dateisysteme und Bibliotheken für Ein-/Ausgabe (E/A) stellen Schnittstellen für den Zugriff auf unterschiedlichen Abstraktionsebenen bereit. Während die Syntax einer Schnittstelle lediglich deren Operationen festlegt, beschreibt ihre Semantik das Verhalten der Operationen. Es existieren mehrere Standards für E/A-Schnittstellen, die teilweise mehrere Jahrzehnte alt sind und für lokale Dateisysteme entwickelt wurden; ein solcher Vertreter ist POSIX. Viele parallele verteilte Dateisysteme implementieren eine POSIX-konforme Schnitt- stelle, um die Portabilität zu erhöhen. Ihre strikte Semantik wird oft relaxiert, um die maximal mögliche Leistung erreichen zu können, was aber zu subtil unterschiedlichem Verhalten führen kann. Dies kann wiederum zu schwer nachzuvollziehenden Fehlver- halten der Anwendungen führen. Alle momentan verfügbaren Schnittstellen verfolgen einen statischen Semantikansatz, wodurch sie nur für bestimmte Anwendungsfälle geeignet sind. Während die Schnittstellen keine Möglichkeit für Anwendungsent- wickler bieten, die Semantik zu beeinflussen, wäre ein solcher Ansatz hilfreich für Anwendungen, um die Dateisysteme an ihre Anforderungen anpassen zu können. Die vorliegende Dissertation beschäftigt sich mit dem Entwurf und der Entwick- lung einer neuartigen E/A-Schnittstelle namens JULEA. Sie erlaubt es, die Semantik dynamisch anzupassen und ist speziell für HPC-Anwendungen optimiert. Das ent- wickelte Konzept erlaubt es Anwendungen, das Verhalten des Dateisystems an die eigenen E/A-Bedürfnisse anzupassen. Das Ziel ist eine Schnittstelle, die es Entwick- lern gestattet zu spezifizieren, was Operationen tun sollen und wie sie sich verhalten sollen; die eigentliche Umsetzung und mögliche Optimierungen werden dabei dem Dateisystem überlassen. Aufgrund der einzigartigen Anforderungen der Schnittstelle wird außerdem ein prototypisches Dateisystem entworfen und entwickelt. Das Dateisystem und die Schnittstelle werden mit Hilfe von synthetischen Bench- marks und praxisnahen Anwendungen evaluiert. Dadurch wird sichergestellt, dass sowohl spezifische Optimierungen als auch die Tauglichkeit für existierende Software überprüft werden. JULEA erreicht eine vergleichbare Daten- und Metadatenleistung wie etablierte parallele verteilte Dateisysteme, kann durch seine flexible Architektur aber einen größeren Teil von Anwendungsfällen effizient abdecken. Die Resultate zeigen, dass E/A-Schnittstellen benötigt werden, die sich an die Anforderungen von Anwendungen anpassen. Obwohl der POSIX-Standard Vorteile bezüglich der Portabilität von Anwendungen bietet, ist seine Semantik nicht mehr für heutige HPC-Anforderungen geeignet. JULEA stellt einen ersten Ansatz dar, der es erlaubt, das Dateisystemverhalten an die Anwendungsanforderungen anzupassen. Acknowledgments First of all, I would like to thank my advisor Prof. Dr. Thomas Ludwig for supporting and guiding me in this endeavor. I first came into contact with high performance computing and file systems during his advanced software lab about the evaluation of parallel distributed file systems in the winter semester of 2005/2006 and have been interested in this topic ever since. I am grateful for the many fruitful discussions, collaborations and fun times with my friends and colleagues from the research group and the DKRZ. In addition to my family, I also want to thank my wife Manuela who readily relocated to Hamburg with me. Special thanks to Konstantinos Chasapis, Manuela Kuhn and Thomas Ludwig for proofreading my thesis and giving me valuable feedback. Last but not least, I would also like to thank everyone that has contributed to JULEA or this thesis in one way or another: Anna Fuchs for creating JULEA’s correctness and performance regression framework and helping with the partdiff benchmarks, Sandra Schröder for building LEXOS and JULEA’s corresponding storage backend, and Alexis Engelke for implementing the reordering logic for the ordering semantics. “There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory which states that this has already happened.” Douglas Adams – The Restaurant at the End of the Universe Contents 1. Introduction 13 1.1. High Performance Computing . 13 1.2. Parallel Distributed File Systems . 15 1.3. Input/Output Interfaces and Semantics . 17 1.4. Motivation . 18 1.5. Contribution . 21 1.6. Structure . 21 2. State of the Art and Technical Background 23 2.1. Input/Output Stack . 23 2.2. File Systems . 26 2.3. Object Stores . 29 2.4. Parallel Distributed File Systems . 30 2.5. Input/Output Interfaces . 35 2.6. Input/Output Semantics . 42 2.7. Namespaces . 46 3. Interface and File System Design 49 3.1. Architecture . 49 3.2. File System Namespace . 54 3.3. Interface . 55 3.4. Semantics . 60 3.5. Data and Metadata . 70 4. Related Work 75 4.1. Metadata Management . 75 4.2. Semantics Compliance . 77 4.3. Adaptability . 78 4.4. Semantical Information . 79 5. Technical Design 87 5.1. Architecture . 89 5.2. Metadata Servers . 91 5.3. Data Servers . 93 5.4. Client Library . 98 – 11 – Contents 5.5. Miscellaneous . 103 6. Performance Evaluation 111 6.1. Hardware and Software Environment . 111 6.1.1. Performance Considerations . 112 6.2. Data Performance . 113 6.2.1. Lustre . 114 6.2.2. OrangeFS . 119 6.2.3. JULEA . 122 6.2.4. Discussion . 134 6.3. Metadata Performance . 135 6.3.1. Lustre . 137 6.3.2. JULEA . 138 6.3.3. Discussion . 149 6.4. Lustre Observations . 149 6.5. Partial Differential Equation Solver . 150 6.5.1. Discussion . 154 7. Conclusion and Future Work 157 7.1. Future Work . 162 Bibliography 167 Appendices 179 A. Additional Evaluation Results 181 B. Usage Instructions 185 C. Code Examples 191 Index 201 List of Acronyms 203 List of Figures 205 List of Listings 207 List of Tables 209 – 12 – Chapter 1. Introduction In this chapter, basic background information from the fields of high performance computing and parallel distributed file systems will be introduced. This includes common use cases and key architectural features. Additionally, the concepts of I/O interfaces and semantics will be briefly explained. A special focus lies on the deficiencies of today’s interfaces which do not allow applications to modify the semantics of I/O operations according

Load more