Optimizing Parallel I/O Performance of Hpc Applications

Copyright 2015 Babak Behzad OPTIMIZING PARALLEL I/O PERFORMANCE OF HPC APPLICATIONS BY BABAK BEHZAD DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2015 Urbana, Illinois Doctoral Committee: Professor Marc Snir, Chair Professor Marianne Winslett Professor William Gropp Doctor Dean Hildebrand, IBM Almaden Research Center ABSTRACT Parallel I/O is an essential component of modern High Performance Computing (HPC). Ob- taining good I/O performance for a broad range of applications on diverse HPC platforms is a major challenge, in part because of complex inter-dependencies between I/O middleware and hardware. The parallel file system and I/O middleware layers all offer optimization parameters that can, in theory, result in better I/O performance. Unfortunately, the right combination of parameters is highly dependent on the application, HPC platform, and problem size/concurrency. Scientific application developers do not have the time or expertise to take on the substantial burden of identifying good parameters for each problem configuration. They resort to using system defaults, a choice that frequently results in poor I/O performance. We expect this problem to be compounded on exascale class machines, which will likely have a deeper software stack with hierarchically arranged hardware resources. We present a line of solution to this problem containing an autotuning system for optimizing I/O performance, I/O performance modeling, I/O tuning, I/O kernel generation, and I/O patterns. We demonstrate the value of these solution across platforms, applications, and at scale. ii To my loving family and friends, who helped me on this journey. iii ACKNOWLEDGMENTS I would like to express my deepest appreciation to Marc Snir, who I was thrilled to have as my adviser. I also greatly appreciate the help and support by Marianne Winslett, Bill Gropp and Dean Hildebrand. Without their support this work was not possible. Let me also thank the non-contiguous support of LBL's Prabhat and Suren Byna and the whole ExaHDF5 project members. I also would like to thank Quincey Koziol and Ruth Aydt from The HDF Group. My labmates deserve a special mention. I spent many hours with them, shared many ideas, and discussed many issues. They taught me as many things as the courses did. Fredrik Kjolstad, Aparna Sasidharan, Jon Calhoun, Alex Brooks, Hoang-Vu Dang, Farah Hariri were the best labmates one can hope for. Without the love and help of my family this was not possible. I would like to first thank my sister and brother-in-law, Banafsheh Behzad and Hadi Tavassol for always being there fore me. And finally my mom and dad, Simin Samadian and Mohammad Reza Behzad for giving me this oppurtunity to be in this journey. iv GRANTS This work is supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02- 05CH11231. This research used resources of the National Energy Research Scientific Com- puting Center, the Texas Advanced Computing Center and the Argonne Leadership Com- puting Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. It was partly supported by NSF grant 0938064. This work was supported by the Office of Advanced Scientific Computing Research, Of- fice of Science, U.S. Department of Energy, under contract numbers DE-AC02-05CH11231 and DE-AC02-06CH11357. This research used resources of the National Energy Research Scientific Computing Center. This work is supported by NSF grant 0938064; by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; and by The HDF Group. This research used resources of Texas Advanced Computing Center. This work is supported by the Director, Office of Science, Office of Advanced Scien- tific Computing Research, of the U.S. Department of Energy under Contract DE-AC02- 05CH11231 and DE-AC02-06CH11307. It used resources of the Texas Advanced Computing Center. v TABLE OF CONTENTS List of Figures ....................................... ix List of Tables........................................ xii List of Algorithms ..................................... xiii CHAPTER 1 Introduction................................ 1 1.1 Dissertation Organization............................2 CHAPTER 2 Background ................................ 4 2.1 Parallel I/O....................................4 2.2 HPC Platforms..................................6 2.3 Application I/O Kernels.............................7 CHAPTER 3 Taming Parallel I/O Complexity with Autotuning........... 12 3.1 Autotuning Framework.............................. 12 3.1.1 H5Evolve: Sampling the search space.................. 14 3.1.2 H5Tuner: Setting I/O parameters at runtime.............. 16 3.2 Experimental Setup................................ 17 3.2.1 Scale and dataset sizes.......................... 17 3.2.2 Parameter space............................. 18 3.3 Results....................................... 20 3.3.1 Tuned I/O performance results..................... 22 3.3.2 Tuned configurations........................... 22 3.3.3 Tuned I/O performance across platforms................ 23 3.3.4 Tuned I/O for different benchmarks................... 27 3.3.5 Tuned I/O at different scales....................... 28 3.4 Conclusions.................................... 28 CHAPTER 4 Improving Parallel I/O Autotuning with Performance Modeling . 31 4.1 Experimental Setup................................ 32 4.2 Empirical Performance Models.......................... 32 4.2.1 Nonlinear regression model preliminaries................ 32 4.2.2 Development of I/O performance models................ 33 4.3 Integration of Performance Models in Autotuning Framework......... 40 vi 4.4 Experimental Results............................... 41 4.4.1 Performance models vs. Genetic algorithms............... 42 4.4.2 Testing on space similar to training................... 42 4.4.3 Testing on a larger space......................... 44 4.4.4 Testing on a different application: VORPAL-IO............ 45 4.4.5 Testing on larger scale.......................... 46 4.4.6 Large-scale results............................ 48 4.4.7 Overall improvement........................... 50 4.4.8 Analysis of the interdependencies.................... 51 4.5 I/O Interference.................................. 56 4.6 Conclusions.................................... 58 CHAPTER 5 A Multi-Level Approach for Understanding I/O Activity in HPC Applications....................................... 61 5.1 Framework..................................... 62 5.2 Evaluation..................................... 63 5.2.1 VPIC-IO Benchmark........................... 63 5.2.2 Simple HDF5 Benchmark........................ 65 5.3 Conclusions.................................... 68 CHAPTER 6 Automatic Generation of I/O Kernels for HPC Applications . 69 6.1 Framework..................................... 70 6.1.1 I/O Tracing: Recorder.......................... 70 6.1.2 Trace Merging............................... 71 6.1.3 Code Generation............................. 74 6.2 Setup and Evaluation Results.......................... 75 6.2.1 Correctness of the framework...................... 75 6.2.2 Quality of the generated code...................... 76 6.3 Conclusions.................................... 77 CHAPTER 7 Pattern-driven Parallel I/O Tuning ................... 78 7.1 I/O Autotuning Framework........................... 79 7.1.1 I/O Traces................................. 80 7.1.2 Extraction and Identification of High-level I/O Patterns....... 80 7.2 Setup and Evaluation Results.......................... 83 7.2.1 An application with the same I/O pattern............... 85 7.2.2 An application with similar I/O pattern................ 85 7.2.3 A new application............................. 87 7.3 Conclusions.................................... 87 CHAPTER 8 Related Work............................... 89 8.1 Autotuning.................................... 89 8.2 I/O Modeling................................... 91 8.3 I/O Recording................................... 92 8.4 I/O Replaying................................... 93 vii 8.5 I/O Patterns................................... 94 CHAPTER 9 Concluding Remarks ........................... 95 9.1 Comparison of the approaches.......................... 95 9.2 Contributions................................... 95 9.3 Future Research Directions............................ 97 REFERENCES....................................... 99 viii LIST OF FIGURES 2.1 Parallel I/O Stack and various tunable parameters...............5 2.2 An illustration of Cray CB algorithm 2......................6 2.3 Partitioning of file domains and processors between aggregators in VPIC- IO when the Lustre stripe size is (a) 16MB, (b) 128MB............. 10 2.4 3D Block structure of VORPAL-IO datasets in HDF5............. 11 3.1 Overall Architecture of the Autotuning Framework.............. 14 3.2 A pictorial depiction of the genetic algorithm used in the autotuning framework. 15 3.3 Design of H5Tuner component as a dynamic library which intercepts HDF5 functions to tune I/O parameters........................ 17 3.4 An XML file showing a sample configuration with optimization parameters at different levels of the parallel I/O stack. The tuning can

Optimizing Parallel I/O Performance of Hpc Applications

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support