Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers

Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers Entwurf und Performance-Evaluierung eines Software-Frameworks für Multi-Physik-Simulationen auf heterogenen Supercomputern Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades Doktor-Ingenieur vorgelegt von Dipl.-Inf. Christian Feichtinger Erlangen, 2012 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: 11. Juni 2012 Tag der Promotion: 24. July 2012 Dekan: Prof. Dr. Marion Merklein Berichterstatter: Prof. Dr. Ulrich Rüde Prof. Dr. Takayuki Aoki Prof. Dr. Gerhard Wellein Abstract Despite the experience of several decades the numerical simulation of computational fluid dynamics is still an enormously challenging and active research field. Most simulation tasks of scientific and industrial relevance require the modeling of multiple physical effects, complex numerical algorithms, and have to be executed on supercomputers due to their high computational demands. Fac- ing these complexities, the reimplementation of the entire functionality for each simulation task, forced by inflexible, non-maintainable, and non-extendable implementations is not feasible and bound to fail. The requirements to solve the involved research objectives can only be met in an interdisciplinary effort and by a clean and structured software development process leading to usable, maintainable, and efficient software designs on all levels of the resulting software framework. The major scientific contribution of this thesis is the thorough design and implementation of the software framework WaLBerla that is suitable for the simulation of multi-physics simulation tasks centered around the lattice Boltzmann method. The design goal of WaLBerla is to be usable, maintainable, and extendable as well as to enable efficient and scalable implementations on massively parallel supercomputers. In addition, a performance analysis of lattice Boltzmann simulations has been conducted on heterogeneous supercomputers using a MPI, a hybrid, and a heterogeneous parallelization approach and over 1000 GPUs. With the help of a performance model for the communication overhead the parallel performance has been accurately estimated, understood, and optimized. It is shown that WaLBerla introduces no significant performance overhead and that efficient hardware-aware implementations are possible in WaLBerla. Furthermore, the applicability and flexibility of WaLBerla is demonstrated in simulations of particle flows and nano fluids on CPU-GPU clusters. By the suc- cessful application of WaLBerla in various simulation tasks, and the analysis of the performance and the software quality with help of quality criteria, it is shown that the design goals of WaLBerla have been fulfilled. Zusammenfassung Die numerische Simulation von strömungsmechanischen Anwendungen ist trotz jahrzehntelanger Erfahrung immer noch ein aktives Forschungsfeld mit enor- men Herausforderungen. Viele der wissenschaftlich und industriell relevanten Anwendungen erfordern die Modellierung einer Vielzahl physikalischer Effekte, komplexe numerische Algorithmen, und sind Aufgrund ihres Rechenaufwandes nur auf Supercomputern ausführbar. Im Angesicht dieser Komplexität ist die zeitaufwendige Reimplementierung der gesamten Funktionalität für jede neue Anwendung, erzwungen durch unflexible, nicht wartbare und erweiterbare Im- plementierungen, kein adäquater Ansatz und zum Scheitern verurteilt. Die An- forderungen zum Lösen der Problemstellungen können nur in interdisziplinären Forschungsvorhaben mit Hilfe eines systematischen und strukturierten Softwa- reentwicklungsprozesses erfüllt werden. Der zentrale wissenschaftliche Beitrag dieser Arbeit ist die nachhaltige Entwick- lung des Software-Frameworks WaLBerla zur Simulation von Multi-Physik-An- wendungen mittels der Lattice Boltzmann Methode. Das Designziel von WaL- Berla ist sowohl anwendbar, flexible, und erweiterbar zu sein, als auch effiziente und skalierbare Implementierungen für massiv parallele Simulation auf Super- computern zu ermöglichen. Weiterhin wurde in dieser Arbeit eine Performance-Studie von Lattice-Boltz- mann-Simulationen für Parallelisierungsansätze basierend auf MPI, hybrider, und heterogener Parallelisierungen auf heterogenen CPU-GPU Clustern auf über 1000 GPUs durchgeführt. Mit Hilfe eines Performance-Models für den Parallelisierungs-Overhead konnte die parallele Performance akkurat vorherge- sagt, untersucht, und optimiert werden. Es konnte gezeigt werden, dass WaL- Berla keinen signifikanten Performance-Overhead aufweist und das effiziente Implementierungen in WaLBerla möglich sind. Außerdem wurde mit Hilfe von Fluidisierungs- und Nano-Fluid-Simulation auf CPU-GPU Clustern die Anwendbarkeit und Flexibilität von WaLBerla demon- striert. Durch die Analyse der Performance und der Softwarequalität anhand von Qualitätskriterien, und dem erfolgreichen Einsatz von WaLBerla in unter- schiedlichen Simulationsszenarien konnte gezeigt werden, dass die Designziele von WaLBerla erreicht wurden. Acknowledgements I would like to express my gratitude to my PhD supervisor Prof. Dr. Ulrich Rüde for giving me the great possibility to work on this project. Thank you very much for the ongoing support of my work, for helpful discussions and encouragement throughout the time. This also includes the continuous support of the WaLBerla project as well as providing the environment necessary for its development. I wish to thank Prof. Dr. Gerhard Wellein not only for taking the time to review my PhD thesis, but also for the thought-provoking impulses on HPC computing and the excellent cooperation within the SKALB project. I would like to extend my gratitude to Prof. Dr. Takayuki Aoki (Aoki Labora- tory, Tokyo Institute of Technology, Japan) who kindly agreed to review my PhD thesis. Also I would like to thank him for his scientific support, especially for providing the compute time on the Tsubame 2.0 supercomputer. The dissertation would not have been possible in this way without it. Many thanks also to the HPC team of the RRZE: Thomas Zeiser, Georg Hager, Markus Wittmann, and Jan Treibig. On the one hand for the great teamwork within the SKALB project - I really enjoyed working with you, and on the other hand for the long-lasting support, the useful advices, teaching me how to pro- gram efficiently, and how to “fool the masses”. A big “thanks a lot” to all my colleagues, especially to Frank Deserno, Tobias Preclik, Björn Gmeiner, Iris Weiss, and Gaby Fleig. Thank you for the friendly atmosphere and all the help. It has always been fantastic talking to and laugh- ing with you. This also includes the WaLBerla team, Kristina Pickl, Dominik Bartuschat, Jan Götz, Regina Ammer, Daniela Anderl, Matthias Markl, Christian Godenschwager, Florian Schornbaum, and Martin Bauer. I especially enjoyed your cooperative way to promote our project. Special thanks to: • Klaus Iglberger, who was not only my roommate for many years, but also has been a great friend of mine and my priceless source of C++ knowledge. Thank you so much for invaluable help with the final steps of this work. • Johannes Habich for the great time, the awesome discussions, and that he has always listened to my complains. • Harald Köstler, who despite being constantly overloaded always has an open ear and great new ideas. • Stefan Donath for numerous fruitful discussion sometimes provocative, but always rewarding. Also thank you for being there for me in times of trou- ble. • Felipe Aristizabal, for the pleasant collaboration and the enjoyable video conferences. I would like to give credit to the people who supported me writing the thesis by proofreading parts of it: Klaus and Stefanie Iglberger (not to forget: thanks for the pile of sugar), Sabine Feichtinger, Johannes Habich, Thomas Zeiser, Georg Hager, Regina Ammer, Felipe Aristizabal, Stefan Donath, Harald Köstler, Philipp Neumann, Martin Bauer, Kristina Pickl, and Dominik Bartuschat. I owe immense gratitude to my friends for continuous support and necessary distractions: Dirk, Julia, Anke, Alex, and Silvia. Very special thanks to my wife Sabine, who gave me strength, taught me dis- cipline, and backed me up during the final PhD phase that seemed to last for years. I deeply thank you also for your unquestioning love, even if I’ll be abroad for half a year. My parents and also my parents-in-law receive my immense gratitude for their continuous support, unconditional love, and always believing in me. Contents 1 Motivation and Outline 1 1.1 The Need of HPC Software Design for Complex Flow Simulations. 1 I Methods and Architectures 5 2 Brief Introduction to the Lattice Boltzmann Method 7 2.1 TheLatticeBoltzmannMethod . 7 2.1.1 BoundaryConditions . 11 2.2 TheFluctuatingLatticeBoltzmannMethod . .. 12 2.2.1 TheMRTCollisionOperator . 13 2.2.2 TheFLBMCollisionOperator. 15 3 Current Supercomputer Architectures 17 3.1 GeneralClusterSetup . .. .. .. .. .. .. .. 18 3.2 TheNVIDIAGF100Tesla“Fermi"GPU . 20 4 Implementation Details 23 4.1 TheCUDAArchitectureandtheCUDACLanguage. 23 4.2 KernelImplementation. 26 4.2.1 TheLBMKernelforCPUs. 27 4.2.2 TheLBMKernelforNVIDIAGPUs . 28 4.2.3 TheFluctuationLBMKernelforNVIDIAGPUs . 29 II WaLBerla - An HPC Software Framework for Multi- Physics Simulations 31 5 Designing the Software Framework WaLBerla 35 5.1 DerivingtheDesignGoalsofWaLBerla . 35 5.2 The WaLBerla Prototypes and their Physical Design . .... 40 6 Analysis of the Logical Software Design

Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers

Desarrollo Del Juego Sky Fighter Mediante XNA 3.1 Para PC

A Data-Driven Approach for Personalized Drama Management

Basado En Imágenes Parametrizadas Sobre Resnet Para IdentiCar Intrusiones En 'Smartwatches' U Otros Dispositivos ANes

Google Adquiere Motorola Mobility * Las Tablets PC Y Su Alcance * Synergy 1.3.1 * Circuito Impreso Al Instante * Proyecto GIMP-Es

Estado Del Arte Metodologia Para El Desarrollo De

Tutorial CUDA

Krakow, Poland

Accelerating Hpc Applications on Nvidia Gpus

Video Game Design Method for Novice

On the Modeling, Simulation, and Visualization of Many-Body Dynamics Problems with Friction and Contact

Integration of a Raytracing-Based Visualization Component Into An

NUMA and GPU So, I Know How to Use MPI and Openmp