LIU-DOCUMENT-2014.Pdf

ABSTRACT Performance Analysis of Program Executions on Modern Parallel Architectures by Xu Liu Parallel architectures have become common in supercomputers, data centers, and mobile chips. Usually, parallel architectures have complex features: many hardware threads, deep memory hierarchies, and non-uniform memory access (NUMA). Pro- gram designs without careful consideration of these features may lead to poor performance on such architectures. First, multi-threaded programs can su↵er from performance degradation caused by imbalanced workload, overuse of synchronization, and parallel overhead. Second, parallel programs may su↵er from the long latency to the main memory. Third, in a NUMA system, memory accesses can be remote rather than local. Without a NUMA-aware design, a threaded program may have many costly remote accesses and imbalanced memory requests to NUMA domains. Performance tools can help us take full advantage of the power of parallel architectures by providing insight into where and why a program fails to obtain top performance. This dissertation addresses the difficulty of obtaining insights about performance bottlenecks in parallel programs using lightweight measurement techniques. This dissertation makes four contributions. First, it describes a novel performance analysis method for OpenMP programs, which can identify root causes of performance losses. Second, it presents a data-centric analysis method that associates performance metrics with data objects. This data-centric analysis can both identify both a program’s problematic memory accesses and associated variables; this information can help an application developer optimize programs for better locality. Third, this dissertation discusses the development of a lightweight method that col- lects memory reuse distance to guide cache locality optimization. Finally, it describes implemented a lightweight profiling method that can help pinpoint performance losses in programs on NUMA architectures and provide guidance about how to transform the program to improve performance. To validate the utility of these methods, I implemented them in HPCToolkit, a state-of-the-art profiler developed at Rice University. I used the extended HPCToolkit to study several parallel programs. Guided by the performance insights provided by the new techniques introduced in this dissertation, I optimized all of these programs and was able to obtain non-trivial improvements to their performance. The measurement overhead incurred by these new analysis methods is very small in both runtime and memory. Acknowledgments I would first thank my advisor John Mellor-Crummey at Rice University. His guidance, inspiration, and support infuse this work and foster my interests in research. More specifically, this thesis consists of texts in the published research papers [88, 85, 87, 89, 86] written by John and I. I would also like to acknowledge current and former members on Rice HPCToolkit team: Laksono Adhianto, Michael Fagan, Mark Krentel, and Nathan Tallent for their development and maintenance of HPCToolkit. Without their e↵ort, this dissertation would not have been possible. I want to express my gratitudes to my friends in the Department of Computer Science at Rice University, such as Zhuhua Cai, Milind Chabbi, Karthik Murphy, Chaoran Yang, and friends outside of the the department, such as Jiebo Li, Chao Sun, and Jiakui Wang. They have spent time in discussing research issues with me and inspired my thoughts about innovations. I would like to thank my wife, Shuyin Jiao, who has continued showing her love to me since 2004. Though we were in di↵erent countries for three years, she never stoped her support for me. Shuyin has given me concrete advise and help when I came across any difficulty. She is a good wife and also a good advisor in my life. Finally, I would show my gratitudes and respects to my father, Yuyuan Liu and my mother, Sumei Yang. Your love and support drive me to pursue the Ph.D. degree. This dissertation is dedicated to both of you. Contents Abstract ii Acknowledgments iv List of Illustrations ix List of Tables xiii 1 Introduction 1 2 Background 7 2.1 HardwareSampling............................ 7 2.2 Call Path Profiling . 10 2.3 OverviewofHPCToolkit ......................... 13 2.4 Benchmarks for Evaluation . 15 3RelatedWork 18 3.1 PerformanceToolsforOpenMPPrograms . 18 3.2 Measurement-based Tools for Analyzing Memory Bottlenecks . 21 3.3 SimulationMethodsforMemoryBottlenecks . 23 3.3.1 Performance Tools Supporting Reuse Distance Measurement. 23 3.3.2 Algorithms for Accelerating Reuse Distance Measurement . 25 3.4 Tools for Identifying NUMA Inefficiencies................ 27 3.5 ComparisonwithOurApproaches. 29 4 Identifying Bottlenecks in OpenMP Programs 32 4.1 Approach ................................. 34 vi 4.1.1 AttributingIdlenessandLockWaiting . 35 4.1.2 Deferred Context Resolution . 43 4.2 Tool Implementation . 49 4.3 Case Studies . 51 4.3.1 AMG2006 ............................. 53 4.3.2 LULESH.............................. 57 4.3.3 NASBT-MZ ........................... 59 4.3.4 HEALTH ............................. 61 4.4 Discussion . 63 5 Analyzing Memory Bottlenecks with Lightweight Mea- surement 65 5.1 Motivation . 68 5.1.1 Data-centric Profiling . 68 5.1.2 Scalability . 70 5.2 Data-centricCapabilitiesforHPCToolkit . 72 5.2.1 Online Call Path Profiler . 73 5.2.2 Post-mortemAnalyzerandUserInterface . 78 5.3 Case Studies . 79 5.3.1 AMG2006 ............................. 81 5.3.2 Sweep3D . 85 5.3.3 LULESH.............................. 87 5.3.4 Streamcluster . 90 5.3.5 Needleman-Wunsch. 91 5.4 Discussion . 92 6 Analyzing Memory Bottlenecks with Cache Simulation 94 6.1 Identify Data Locality Bottlenecks . 96 vii 6.2 Lightweight Reuse Distance Measurement . 98 6.2.1 Reuse Metric Computation . 100 6.2.2 Call Path Collection . 101 6.2.3 Data-centricAnalysis. 102 6.2.4 OverheadandAccuracyAnalysis . 104 6.3 Tool Organization . 106 6.3.1 Call Path Profiler . 106 6.3.2 Pin Plug-in . 107 6.3.3 Post-mortemAnalysisTool . 108 6.4 Case Studies . 108 6.4.1 LULESH.............................. 109 6.4.2 Sweep3D . 112 6.4.3 NASLU.............................. 113 6.4.4 Sphot . 115 6.4.5 S3D . 116 6.5 Discussion . 117 7 Analyzing NUMA Inefficiencies in Threaded Programs 119 7.1 NUMA-awareProgramDesign. 123 7.2 AddressSampling............................. 125 7.3 NUMAMetrics .............................. 127 7.3.1 Identifying Remote Accesses and Imbalanced Requests . 127 7.3.2 NUMALatencyperInstruction . 129 7.4 Metric Attribution . 130 7.4.1 Code- and Data-centric Attribution . 131 7.4.2 Address-centricAttribution . 131 7.5 PinpointingFirstTouches . 133 7.6 Tool Implementation . 135 viii 7.6.1 Online Profiler . 135 7.6.2 O✏ineAnalyzerandViewer . 136 7.7 Experiments . 136 7.7.1 LULESH.............................. 139 7.7.2 AMG2006 ............................. 142 7.7.3 Blackscholes . 145 7.7.4 UMT2013 ............................. 147 7.8 Discussion . 148 8 Conclusions 150 8.1 Summary of Contributions . 150 8.2 Open Problems . 152 Bibliography 154 Illustrations 2.1 An example profile presented by hpcviewer for a parallel program running on a 48-core machine. 14 2.2 An example trace view of hpctraceviewer for a parallel program running on a 48-core machine. 15 4.1 The call stack for master, sub-master and worker threads in an OpenMP program with three levels of nested parallelism. As shown, stacks grow from bottom to top. The calling contexts for the parallel regions shown in grey are scattered across two or more threads. 44 4.2 HPCToolkit’s time-centric view rendering of a complete execution of AMG2006 on 64 cores – 8 MPI ranks with 8 OpenMP threads/rank. 53 4.3 HPCToolkit’s time-centric view rendering of a single iteration of AMG2006’s solver on 64 cores – 8 MPI ranks with 8 OpenMP threads/rank. 54 4.4 A calling context view of AMG2006’s solver phase. More than 12% of the execution time of all threads is spent idling in the highlighted parallel region. 55 4.5 A bottom-up view of call path profiles for LULESH. madvise,which is called while freeing memory, is identified as the routine with the most idleness. 57 4.6 Pseudo code showing how memory is allocated, initialized, and freed inLULESH................................. 58 x 4.7 A calling context view of call path profiles for BT-MZ shows that the amount of idleness reported for exch qbc is over seven times larger than the work it performs. 60 4.8 Bottom up view of HEALTH benchmark showing lock contention in the GOMP library. 61 5.1 Code-centric profiling aggregates metrics for memory accesses in the same source line; data-centric profiling decomposes metrics by variable. 69 5.2 A loop allocating data objects in the heap. 71 5.3 A simplified workflow that illustrates data-centric measurement and analysis in HPCToolkit. Rectangles are components of HPCToolkit; ellipses are inputs or outputs of di↵erent components. 72 5.4 The top-down data-centric view of AMG2006 shows a problematic heap allocated variable and its two accesses that su↵er most remote events to this variable. 82 5.5 The bottom-up data-centric view of AMG2006 shows problematic allocation call sites which may reside in di↵erent call paths. 84 5.6 The data-centric view of Sweep3D shows heap allocated variables with high latency. Three highlighted arrays have high memory access latency. 85 5.7 The data-centric view of Sweep3D shows a memory access with long latency to array Flux. This access is in a deep call chain. 86 5.8 The data-centric view shows problematic heap allocated arrays with high latency in LULESH. The red rectangle encloses the call paths of all heap variables that incur significant memory latency. 88 5.9 The data-centric view shows a problematic static variable and its accesses with high latency in LULESH. The ellipse shows that variable’sname............................... 89 xi 5.10 The data-centric view associates a large number of NUMA related events to a problematic heap allocated variable and its inefficient accessesinStreamclusterbenchmark.

Load more