Comparison and End-To-End Performance Analysis of Parallel File Systems

Comparison and End-to-End Performance Analysis of Parallel File Systems Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt an der Technischen Universität Dresden Fakultät Informatik eingereicht von Diplom-Ingenieur für Informationssystemtechnik Michael Kluge geboren am 13. Februar 1974 in Torgau/Elbe Gutachter: Prof. Dr. rer. nat. Wolfgang E. Nagel, Technische Universität Dresden Prof. Dr. rer. nat. habil. Thomas Ludwig, Universität Hamburg Tag der Einreichung: 10. Januar 2011 Tag der Verteidigung: 5. September 2011 1 Time estimates of a distinguished engineer James T. Kirk: How much refit time before we can take her out again? Montgomery Scott: Eight weeks, Sir, but ya don’t have eight weeks, so I’ll do it for ya in two. James T. Kirk: Mr.Scott. Have you always multiplied your repair estimates by a factor of four? Montgomery Scott: Certainly, Sir. How else can I keep my reputation as a miracle worker? from the movie ’Star Trek III: The Search for Spock’ 2 3 Acknowledgment First of all I’d like to thank everyone who supported this thesis, either by advise or by sharing and discussing ideas. The first I want to thank is Prof. Wolfgang E. Nagel for the support of my work at the ZIH at TU Dresden and for advising this thesis. After joining the former ZHR in 2002 as a student assistant I had the chance to be a part of different projects, work abroad for six month and finally to become a regular staff member. Next, I’d like to thank my reviewer Prof. Thomas Ludwig for his advice and feedback. I’d like to thank all colleagues that worked closely with me within these years, guided me, and shared thoughts. Especially Andreas Knüpfer for constantly reminding me to focus on the completion of this thesis, for discussing the structure and the content and for setting a good example. And Guido Juckeland for taking over my parts of the tasks concerning the HRSK complex. Ralph Müller-Pfefferkorn, Matthias Müller and Holger Brunst I want to thank for reviewing my compositions. From the Indiana University I want to thank Stephen Simms and Robert Henschel for their input concerning the Lustre file system. Without the opportunity to take part in the Bandwidth Challenge at the SuperComputing 2007 conference in Reno a lot of things would have developed differently. Holger Mickler and Thomas Ilsche have written their Master Theses under my supervision on the topic of my doctoral thesis and contributed to different parts. I have to thank for their basic groundwork that opened our eyes for the global problems. My wife Grit and my kids Ronja and Tjorven I thank for tolerating my countless attempts to convert family time to work time. I’d like thank my grandparents, parents and my brother for believing in my work and their support. 4 5 Abstract This thesis presents a contribution to the field of performance analysis for Input/Output (I/O) related problems, focusing on the area of High Performance Computing (HPC). Beside the compute nodes, High Performance Computing systems need a large amount of supporting components that add their individual behavior to the overall performance characteristic of the whole system. Especially file systems in such environments have their own infrastructure. File operations are typically initiated at the compute nodes and proceed through a deep software stack until the file content arrives at the physical medium. There is a handful of shortcomings that characterize the current state of the art for performance analyses in this area. This includes a system wide data collection, a comprehensive analysis approach for all collected data, an adjusted trace event analysis for I/O related problems, and methods to compare current with archived performance data. This thesis proposes to instrument all soft- and hardware layers to enhance the performance analysis for file operations. The additional information can be used to investigate performance characteristics of parallel file systems. To perform I/O analyses on HPC systems, a comprehensive approach is needed to gather related performance events, examine the collected data and, if necessary, to replay relevant parts on different systems. One larger part of this thesis is dedicated to algorithms that reduce the amount of information that are found in trace files to the level that is needed for an I/O analysis. This reduction is based on the assumption that for this type of analysis all I/O events, but only a subset of all synchronization events of a parallel program trace have to be considered. To extract an I/O pattern from an event trace, only these synchronization points are needed that describe dependencies among different I/O requests. Two algorithms are developed to remove negligible events from the event trace. Considering the related work for the analysis of a parallel file systems, the inclusion of counter data from external sources, e.g. the infrastructure of a parallel file system, has been identified as a major milestone towards a holistic analysis approach. This infrastructure contains a large amount of valuable information that are essential to describe performance effects observed in applications. This thesis presents an approach to collect and subsequently process and store the data. Certain ways how to correctly merge the collected values with application traces are discussed. Here, a revised definition of the term “performance counter” is the first step followed by a tree based approach to combine raw values into secondary values. A visualization approach for I/O patterns closes another gap in the analysis process. Replaying I/O related performance events or event patterns can be done by a flexible I/O benchmark. The constraints for the development of such a benchmark are identified as well as the overall architecture for a prototype implementation. Finally, different examples demonstrate the usage of the developed methods and show their potential. All examples are real use cases and are situated on the HRSK research complex and the 100GBit Testbed at TU Dresden. The I/O related parts of a Bioinformatics and a CFD application have been analyzed in depth and enhancements for both are proposed. An instance of a Lustre file system was deployed and tuned on the 100GBit Testbed by the extensive use of external performance counters. 6 Contents 7 Contents Abstract 5 1 Introduction 9 1.1 Introduction to High Performance Computing and Parallel File Systems . .9 1.2 Performance Analysis in the Context of I/O and Contributions . 11 1.3 Organization of this Thesis . 12 2 I/O in the Context of High Performance Computing 13 2.1 A Brief History of HPC I/O . 13 2.2 Local, Distributed and Parallel File Systems . 15 2.3 User Access to Parallel File Systems . 17 2.3.1 POSIX I/O . 18 2.3.2 MPI-I/O . 18 2.3.3 HDF5 . 20 2.3.4 (p)NetCDF . 21 2.4 Administrative Challenges . 22 2.5 Future Directions . 23 3 Analysis of Current Parallel File Systems and Related Work 25 3.1 Parallel File System Metrics . 25 3.1.1 Taxonomy of Parallel File Systems . 25 3.1.2 Performance Metrics for Parallel File Systems . 29 3.2 Selected Parallel File Systems . 29 3.2.1 GPFS . 30 3.2.2 CXFS . 33 3.2.3 Lustre . 35 3.2.4 PanFS . 39 3.2.5 pNFS . 41 3.2.6 Comparison . 42 3.2.7 Other Present Parallel File Systems . 43 3.3 Storage Hardware . 45 3.3.1 RAID . 45 3.3.2 Integration Factor of Storage Hardware . 46 3.3.3 Further Developments . 47 3.4 State of the Art for Storage System Monitoring . 48 3.4.1 Definitions . 48 3.4.2 Storage Hardware Monitoring . 49 3.5 State of the Art for I/O Tracing . 49 3.5.1 Tracing at the Kernel Level . 49 3.5.2 Tracing at the User Space . 51 3.5.3 Multi-Layer I/O Tracing . 52 3.6 State of the Art for I/O Benchmarking . 53 3.6.1 Overview . 54 8 Contents 3.6.2 Fixed I/O Benchmarks . 54 3.6.3 Flexible I/O Benchmarks . 54 3.7 Approach for a Comprehensive I/O Analysis . 55 3.8 Summary . 56 4 End-To-End Performance Analysis of Large SAN Structures 57 4.1 System Wide Performance Analysis . 57 4.1.1 Redefinition of the Term “Performance Counter” . 58 4.1.2 Extended Notion of Hardware Counters . 59 4.1.3 Examples for Extended Counters . 60 4.1.4 Consequences from the Users Perspective . 62 4.1.5 Calculation of Secondary Counters . 62 4.1.6 Layered Counter Collection Architecture . 63 4.1.7 Secondary Counter Calculation Tree . 66 4.1.8 Timing Issues in Distributed Data Collection Environments . 69 4.1.9 Timestamp Calibration for Data Sources . 72 4.1.10 Layout of the System Design . 72 4.2 I/O Tracing, Analysis and Replay on Parallel File Systems . 73 4.2.1 Extended I/O Tracing . 75 4.2.2 Intermediate Data Representation . 75 4.2.3 Elimination of Redundant Synchronization Points from Event Traces . 80 4.2.4 Pattern Matching for POSIX I/O . 83 4.2.5 I/O Pattern Visualization . 86 4.2.6 I/O Replay and Benchmarking . 87 4.3 Summary . 90 5 Prototype Implementation and Evaluation 91 5.1 Deployment . 91 5.1.1 SGI Altix 4700 . 91 5.1.2 Linux NetworkX PC Farm . 93 5.1.3 System Monitor Deployment . 94 5.2 Challenges of the Implementation . 96 5.2.1 Timer and Offset Calibration . 96 5.2.2 Connection of the Dataheap Prototype and the Trace Facility .

Load more