Efficient Task-Local I/O Operations of Massively Parallel Applications
Total Page:16
File Type:pdf, Size:1020Kb
On current large-scale HPC systems often occur I/O patterns that produce a high load on the file system during access to checkpoint and restart files. Applications running on systems with distributed memory will often perform such I/O individually by creating task-local file objects on the file system. At large scale, these task-local I/O patterns impose substantial stress on the metadata management components of the I/O subsystem. Such metadata contention occurs also at the startup of dynamically linked applications while searching for library files. The reason for these limitations is that the serial I/O components of the operating system do not take advantage of application parallelism. To avoid the above bottlenecks, this work describes two novel approaches which exploit the knowledge of application parallelism, the underlying I/O subsystem structure, the parallel file system configuration, and the network between HPC-system and I/O system to coordinate and optimize access to file-system objects. The underlying methods are implemented in two tools, SIONlib and Spindle, which add layers between the parallel application and the corresponding POSIX-based standard interfaces of the operating system, eliminating the need for modifying the underlying system software. SIONlib is already applied in applications to implement efficient checkpointing and is also integrated in the performance-analysis tools Scalasca and Score-P to efficiently store trace data. Latest benchmarks on the Blue Gene/Q in Jülich demonstrate that SIONlib solves the metadata problem at large scale by running efficiently up to 1.8 million tasks while maintaining high I/O bandwidths of 60-80% of file-system peak with a negligible file-creation time. The scalability of Spindle could be demonstrated by running a benchmark on a cluster of Lawrence Livermore National Laboratory at large scale. The results show that the startup of dynamically linked applications is now feasible on more than 15000 tasks, whereas the overhead of Spindle I/O Operations Task-Local Efficient is nearly constantly low. With SIONlib and Spindle, this work demonstrates how scalability of operating system components can be improved without modifying them and without changing the I/O patterns of applications. In this way, SIONlib and Spindle represent prototype implementations of functionality needed by next-generation runtime systems. Frings Wolfgang This publication was edited at the Jülich Supercomputing Centre (JSC) which is an integral part of the Institute for Advanced Simulation (IAS). The IAS combines the Jülich simulation sciences and the supercomputer facility in one organizational unit. It includes those parts of the scientific institutes at Forschungszentrum Jülich which use simulation on supercomputers as their main research methodology. IAS Series IAS Series Efficient Task-Local I/O Operations of Massively Parallel Applications Wolfgang Frings IAS Series Volume 30 ISBN 978-3-95806-152-1 30 Member of the Helmholtz Association Member of the Schriften des Forschungszentrums Jülich IAS Series Band 30 Forschungszentrum Jülich GmbH Institute for Advanced Simulation (IAS) Jülich Supercomputing Centre (JSC) Efficient Task-Local I/O Operations of Massively Parallel Applications Wolfgang Frings Schriften des Forschungszentrums Jülich IAS Series Band 30 ISSN 1868-8489 ISBN 978-3-95806-152-1 Impressum: Englisches Impressum: „Bibliographic information published by the Deutsche Nationalbibliothek. Die Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;; detailed bibliographic data are available in the Internet at http://dnb.dBibliografische Information-nb.de .der” Deutschen Nationalbibliothek. Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte Bibliografische Daten sind im Internet über http://dnb.d-nb.de abrufbar. Publisher Forschungszentrum Jülich GmbH and Distributer: Zentralbibliothek, Verlag … Cover Design: Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH Printer: …. Copyright: Forschungszentrum Jülich 2016 Herausgeber Forschungszentrum Jülich GmbH und Vertrieb: Zentralbibliothek, Verlag 52425 Jülich Schriften des ForschungszentrumsTel.: +49 2461 61-5368 Jülich IAS Series Fax: Volume +49 2461 30 61-6103 E-Mail: [email protected] www.fz-juelich.de/zb DUmschlaggestaltung: 82 (Diss. RWTH Jülich Supercomputing Aachen Centre, University, Forschungszentrum 2016) Jülich GmbH Druck: Grafische Medien, Forschungszentrum Jülich GmbH ISSNCopyright: 1868 -8489Forschungszentrum Jülich 2016 ISBNSchriften 9 des78 Forschungszentrums-3-95806-152 Jülich-1 IAS Series Volume 30 D 82 (Diss. RWTH Aachen University, 2016) PersistentISSN 1868-8489 Identifier: urn:nbn:de:0001-2016062000 ISBN 978-3-95806-152-1 ThePersistent complete Identifier: urn:nbn:de:0001-2016062000 volume is freely available on the Internet on the Jülicher Open Access Server (JuSER) at www.fz -juelich.de/zb/openaccess The complete volume is freely available on the Internet on the Jülicher Open Access Server (JuSER) at www.fz-juelich.de/zb/openaccess ThisThis is isan Open an Access Open publication Accesspublication distributed distributed under the under terms of the the termsCreative of the CommonsCreative Commons AttributionAttribution License License 4.0, which 4.0, which permits permits unrestricted unrestricted use, distribution, use, and distribution, reproduction and in reproduction in any anymedium, medium, provided provided the the original original work is work properly is cited. properly cited. U4: IAS Series Volume 30 ISBN 978-3-95806-152-1 Titelei als PDF bitte an h.lexis@fz-juelich.de Abstract Applications on current large-scale HPC systems use enormous numbers of processing ele- ments for their computation and have access to large amounts of main memory for their data. Nevertheless, they still need file-system access to maintain program and application data per- sistently. Characteristic I/O patterns that produce a high load on the file system often occur during access to checkpoint and restart files, which have to be frequently stored to allow the application to be restarted after program termination or system failure. On large-scale HPC systems with distributed memory, each application task will often perform such I/O indivi- dually by creating task-local file objects on the file system. At large scale, these I/O patterns impose substantial stress on the metadata management components of the I/O subsystem. For example, the simultaneous creation of thousands of task-local files in the same directory can cause delays of several minutes. Also at the startup of dynamically linked applications, such metadata contention occurs while searching for library files and induces a comparably high metadata load on the file system. Even mid-scale applications cause in such load scenarios startup delays of ten minutes or more. Therefore, dynamic linking and loading is nowadays not applied on large HPC systems, although dynamic linking has many advantages for mana- ging large code bases. The reason for these limitations is that POSIX I/O and the dynamic loader are implemented as serial components of the operating system and do not take advantage of the parallel nature of the I/O operations. To avoid the above bottlenecks, this work describes two novel approa- ches for the integration of locality awareness (e.g., through aggregation or caching) into the serial I/O operations of parallel applications. The underlying methods are implemented in two tools, SIONlib and Spindle, which exploit the knowledge of application parallelism to coordi- nate access to file-system objects. In addition, the applied methods also use knowledge of the underlying I/O subsystem structure, the parallel file system configuration, and the network bet- ween HPC-system and I/O system to optimize application I/O. Both tools add layers between the parallel application and the POSIX-based standard interfaces of the operating system for I/O and dynamic loading, eliminating the need for modifying the underlying system software. SIONlib is already applied in several applications, including PEPC, muphi, and MP2C, to im- plement efficient checkpointing. In addition, SIONlib is integrated in the performance-analysis tools Scalasca and Score-P to efficiently store and read trace data. Latest benchmarks on the Blue Gene/Q in Julich¨ demonstrate that SIONlib solves the metadata problem at large scale by running efficiently up to 1.8 million tasks while maintaining high I/O bandwidths of 60-80% of file-system peak with a negligible file-creation time. The scalability of Spindle could be de- monstrated by running the Pynamic benchmark, a proxy benchmark for a real application, on a cluster of Lawrence Livermore National Laboratory at large scale. The results show that the startup of dynamically linked applications is now feasible on more than 15000 tasks, whereas the overhead of Spindle is nearly constantly low. With SIONlib and Spindle, this work demonstrates how scalability of operating system com- ponents can be improved without modifying them and without changing the I/O patterns of applications. In this way, SIONlib and Spindle represent prototype implementations of func- tionality needed by next-generation runtime systems. Zusammenfassung Auf heutigen Supercomputer-Systemen belasten parallele Anwendungen, welche regelmaßig¨ Checkpoints der im Hauptspeicher befindlichen Simulationsdaten erstellen, das Dateisystem enorm. Zum Beispiel werden auf Supercomputern