Advanced Memory Data Structures for Scalable Event Trace Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Memory Data Structures for Scalable Event Trace Analysis Dissertation zur Erlangung des akademischen Grades Doktor rerum naturalium (Dr. rer. nat.) vorgelegt an der Technischen Universität Dresden Fakultät Informatik eingereicht von Diplom-Mathematiker Andreas Knüpfer geboren am 25. Juli 1976 in Zwickau Gutachter: Prof. Dr. rer. nat. Wolfgang E. Nagel, Technische Universität Dresden Prof. Dr.-Ing. Wolfgang Lehner, Technische Universität Dresden Prof. Dr. techn. habil. Dieter Kranzlmüller, Ludwig-Maximilians-Universität München Tag der Verteidigung: 16. Dezember 2008 Dresden im März 2009 1 Contents Abstract 5 1 Performance Analysis for HPC Applications 7 1.1 Computation and Performance . 7 1.2 Performance Analysis and Optimization . 8 1.3 Performance Analysis Tools . ... 8 1.4 Contribution of this Thesis . 9 2 State-of-the-Art in Event-Based Trace Analysis and Related Work 11 2.1 TracingandLogging .................................. 11 2.2 TraceAnalysisTools .................................. 12 2.2.1 Vampir and VampirServer . 12 2.2.2 ParaverandDimemas ............................... 17 2.2.3 KojakandScalasca................................. 18 2.2.4 Jumpshot ...................................... 22 2.2.5 Debugging Wizard (DeWiz) . 22 2.2.6 Tuning and Analysis Utilities (TAU) . 26 2.3 Trace File Formats . 28 2.3.1 Common Design of Trace File Formats . 28 2.3.2 The Vampir Trace Format Version 3 (VTF3) . 30 2.3.3 The Structured Trace Format (STF) . 30 2.3.4 TheOpenTraceFormat(OTF). 31 2.3.5 The Epilog Trace Format . 32 2.3.6 The Jumpshot Trace Formats . 33 2.3.7 TheParaverTraceFormat ............................ 33 2.3.8 TheDeWizTraceFormat.............................. 34 2.3.9 TheTAUTraceFormat............................... 34 2.4 Memory Data Structures . 35 2.4.1 The Vampir and VampirServer Data Structures . 35 2.4.2 The EARL Data Structures . 36 2.4.3 The DeWiz Data Structures . 37 2.4.4 The TAU Data Structures . 38 2.4.5 Similarity to Trace File Formats . 39 2.5 Access Methods to Event Data Structures . ..... 39 2.5.1 Sequential Iterator . 39 2.5.2 Time Interval Search . 40 2.5.3 Statistic Summaries . 40 2.5.4 Timeline Visualization . 42 2.5.5 Automatic Analysis . 44 2.6 Event Trace Compression by Statistical Clustering . ...... 45 2.7 Memory Access Traces and Compression . ..... 46 2.8 CompressionofMPIReplayTraces . 47 2 3 The Design of the CCG Data Structure 49 3.1 Trace Data and Trace Information . 49 3.2 Tree Data Structures for Event Traces . ....... 50 3.2.1 Time Stamps versus Time Durations . 52 3.2.2 The Bounded Branching Factor . 52 3.3 In-MemoryCompression ................................ 53 3.3.1 Deviation Bounds for Soft Properties . 54 3.3.2 Sub-TreeComparison ............................... 55 3.3.3 Computational Effort . 56 3.4 Customized Analysis Algorithms . 57 3.4.1 The Conservative Approach . 58 3.4.2 Event Access, Iteration and Searching . 58 3.4.3 TheCachedSummaryQuery............................ 59 3.4.4 The Timeline Query . 60 3.4.5 MPISend-ReceiveMatching. 61 3.4.6 MPI Collective Operation Matching . 62 3.4.7 Automatic Analysis Methods . 63 3.5 Persistent Storage and Restoring . ..... 63 3.6 Summary .......................................... 64 4 Algorithms for the CCG Data Structure 65 4.1 CCGConstruction ..................................... 65 4.1.1 General Construction . 65 4.1.2 Splitting of Wide Nodes . 66 4.1.3 GraphNodeEncoding ............................... 69 4.1.4 Graph Node Allocation . 71 4.2 CCGCompression ..................................... 71 4.2.1 Search for Replacement Nodes . 72 4.2.2 CachingofNodes.................................. 72 4.2.3 Influence of the Node Search Order . 74 4.2.4 NodeComparison ................................. 75 4.2.5 Compression Metrics . 77 4.3 The Combined Construction and Compression Algorithm . 78 4.4 Advanced Construction and Compression Techniques . ........ 79 4.4.1 Re-Compression of Existing CCGs . 79 4.4.2 Merging of Disjoint CCGs . 82 4.4.3 Adaptive Deviation Bounds . 82 4.5 CCG Analysis Algorithms . 85 4.5.1 The Random-Access Iterator . 85 4.5.2 Timestamp Search . 86 4.5.3 Summary Query Algorithms . 87 4.5.4 The Timeline-Rendering Algorithm . 90 4.5.5 The MPI Send-Receive Matching Algorithm . 91 4.6 Persistent Storage and Restoring . ..... 92 4.7 CCGs with Distributed Data . 94 4.7.1 Distributed Data Decomposition . 95 4.7.2 Distributed CCG Construction . 95 4.7.3 Distributed CCG Compression . 96 4.7.4 Distributed Serialization and Restoring . 97 4.7.5 Distributed Evaluation . 97 3 5 Evaluation of CCG Algorithms 99 5.1 Theoretical and Synthetic Evaluation . 99 5.1.1 CompressionModel ................................ 99 5.1.2 Non-Monotone Compression . 100 5.1.3 BestCaseCompression .............................. 101 5.1.4 WorstCaseCompression.............................. 102 5.2 Real-World Construction and Compression . 104 5.2.1 Small Scale Compression . 105 5.2.2 Large Scale Compression . 108 5.2.3 Influence of Branching Factor . 108 5.2.4 InfluenceofTraceSize.............................. 109 5.2.5 Influence of Search Length Parameter . 114 5.3 Advanced Construction and Compression Algorithms . 116 5.3.1 Re-CompressionofCCGs ............................. 116 5.3.2 Adaptive CCG Compression . 117 5.3.3 Serialization and Restore of CCGs . 118 5.4 CachedSummaryQueries ................................ 119 5.4.1 Cache Strategies . 119 5.4.2 Experiment Results . 120 5.5 MPISend-ReceiveMatching . 120 5.6 Recommended Parameter Settings . 123 5.6.1 Deviation Bounds for Soft Properties . 123 5.6.2 Algorithm Parameters . 125 5.7 Comparison to State-of-the-Art Tools . 125 6 Conclusion and Outlook 129 A Acknowledgements 131 List of Figures 133 Bibliography 137 4 Contents 5 Abstract The thesis presents a contribution to the analysis and visualization of computational performance based on event traces with a particular focus on parallel programs and High Performance Computing (HPC). Event traces contain detailed information about specified incidents (events) during run-time of programs and allow minute investigation of dynamic program behavior, various performance metrics, and possible causes of performance flaws. Due to long running and highly parallel programs and very fine detail resolutions, event traces can accumulate huge amounts of data which become a challenge for interactive as well as automatic analysis and visualization tools. The thesis proposes a method of exploiting redundancy in the event traces in order to reduce the memory requirements and the computational complexity of event trace analysis. The sources of redundancy are repeated segments of the original program, either through iterative or recursive algorithms or through SPMD-style parallel programs, which produce equal or similar repeated event sequences. The data reduction technique is based on the novel Complete Call Graph (CCG) data structure which allows domain specific data compression for event traces in a combination of lossless and lossy methods. All deviations due to lossy data compression can be controlled by constant bounds. The compression of the CCG data structure is incorporated in the construction process, such that at no point substantial uncompressed parts have to be stored. Experiments with real-world example traces reveal the potential for very high data compression. The results range from factors of 3 to 15 for small scale compression with minimum deviation of the data to factors > 100 for large scale compression with moderate deviation. Based on the CCG data structure, new algorithms for the most common evaluation and analysis methods for event traces are presented, which require no explicit decompression. By avoiding repeated evaluation of formerly redundant event sequences, the computational effort of the new algorithms can be reduced in the same extent as memory consumption. The thesis includes a comprehensive discussion of the state-of-the-art and related work, a detailed pre- sentation of the design of the CCG data structure, an elaborate description of algorithms for construction, compression, and analysis of CCGs, and an extensive experimental validation of all components. 6 Contents 7 1 Performance Analysis for HPC Applications This chapter provides an introduction to the field of High Performance Computing (HPC) in general and the requirement for performance optimization in this domain. Furthermore, it explains the contribution of performance measurement and analysis in the optimization process and shows the contribution of this thesis for the efficient analysis of huge amounts of trace data. 1.1 Computation and Performance Computation has always been an important tool in science and engineering. It allows to deduce accurate and specific facts about a real-world system from general scientific theories by means of mathematical models. This intellectual tool itself is much older than the term computer as we use it today. Indeed, computer used to be a human profession in former times and the modern English word for a computing machine was derived from that [Gri05]. Interestingly enough, parallel and distributed computing was already prevalent in this past era! After the birth of electronic computing in the middle of the 20’th century, this scientific tool has become much more powerful and more affordable and was utilized in more and more scientific and engineering domains. The computational speed has been growing enormously in an exponential manner for almost 50 years according to Moore’s Law [Moo65], and the growth of the memory and storage capacities has been almost as fast. This has created unpredicted advancements in all fields of science and engineering and far beyond. Today, computing devices are regarded as ubiquitous and.