
RICE UNIVERSITY Performance Analysis for Parallel Programs From Multicore to Petascale by Nathan Russell Tallent A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy APPROVED, THESIS COMMITTEE: yMjlt ~ John Mellor-Crummey, (Sha Professor of Computer Science and Electrical & Computer Engineering Vivek Sarkar E.D. Butcher Professor of Computer Science and Electrical/fe Computer Engineering 1 1 +->-^ Peter Va^jflan Professor of Electrical & Computer Engineering and Computer Science Robert Fowler Chief Domain Scientist, High Performance Computing, Renaissance Computing Institute HOUSTON, TEXAS MARCH 2010 UMI Number: 3421196 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMT Dissertation Publishing UMI 3421196 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 Abstract Performance Analysis for Parallel Programs From Multicore to Petascale by Nathan Russell Tallent Cutting-edge science and engineering applications require petascale computing. Petascale computing platforms are characterized by both extreme parallelism (sys­ tems of hundreds of thousands to millions of cores) and hybrid parallelism (nodes with multicore chips). Consequently, to effectively use petascale resources, appli­ cations must exploit concurrency at both the node and system level — a difficult problem. The challenge of developing scalable petascale applications is only partially aided by existing languages and compilers. As a result, manual performance tuning is often necessary to identify and resolve poor parallel and serial efficiency. Our thesis is that it is possible to achieve unique, accurate, and actionable insight into the performance of fully optimized parallel programs by measuring them with asynchronous-sampling-based call path profiles; attributing the resulting binary-level measurements to source code structure; analyzing measurements on-the-fly and post­ mortem to highlight performance inefficiencies; and presenting the resulting context- sensitive metrics in three complementary views. To support this thesis, we have developed several techniques for identifying performance problems in fully optimized serial, multithreaded and petascale programs. First, we describe how to attribute very precise (instruction-level) measurements to source-level static and dynamic con­ texts in fully optimized applications — all for an average run-time overhead of a few percent. We then generalize this work with the development of logical call path profiling and apply it to work-stealing-based applications. Second, we describe tech­ niques for pinpointing and quantifying parallel inefficiencies such as parallel idleness, parallel overhead and lock contention in multithreaded executions. Third, we show how to diagnose scalability bottlenecks in petascale applications by scaling our our measurement, analysis and presentation tools to support large-scale executions. Fi­ nally, we provide a coherent framework for these techniques by sketching a unique and comprehensive performance analysis methodology. This work forms the basis of Rice University's HPCTOOLKIT performance tools. Acknowledgments This dissertation represents more than just my past few years of Computer Science graduate study. Seemingly by accident, I became involved in the early stages of the HPCTOOLKIT performance tools project (nee HPCView), inaugurated by John Mellor-Crummey. Consequently, before beginning any work toward this dissertation, I had helped build most of what became the proto HPCTOOLKIT. Nevertheless, I must highlight this dissertation's profound debt to others. The most generous share of credit goes to my advisor, John Mellor-Crummey, whose guidance and insight inform and infuse this work. I must also acknowledge several additional collaborators (in alphabetical order): • Laksono Adhianto, who is the primary implementer of HPCTOOLKIT'S presen­ tation tool, hpcviewer. • Mike Fagan, who contributed to Chapter 3's on-the-fly binary analysis for un­ winding call stacks and whose continual questions uncover weaknesses in our thinking. • Mark Krentel, whose efforts and commitment to correctness have vastly im­ proved HPCTOOLKIT'S ability to dynamically and statically monitor processes and threads. • Allan Porterfield, who helped develop Chapter 6's blame shifting. Additionally, I am grateful to (in chapter order): • Chapter 3: Mark Charney and Robert Cohn of Intel who assisted with XED2 [38]. • Chapter 6: Robert Fowler for focusing our attention on MADNESS; Robert Harrison for helping us with his MADNESS code; and William Scherer for reminding us of Bacon's prior work on dual-representation locks and pointing out the similarity to STM contention managers. • Chapter 7: Anshu Dubey and Chris Daley of the FLASH team; and Peter Lichtner, Glenn Hammond and other members of the PFLOTRAN team. Both teams graciously provided us with a copy of their respective code, configuration advice, and a test problem of interest. Finally, I would like to acknowledge Robert Fowler, who was deeply involved with HPCTOOLKIT while at Rice; Gabriel Marin, who was part of the original HPC- TOOLKIT team; Nathan Froyd, who worked on an early version of what is now HPC- TOOLKIT'S measurement tool; and Cristian Coarfa, who first explored the scalability analysis technique used in Chapter 7. Development of the HPCTOOLKIT performance tools would not have been pos­ sible without without • support from the Department of Energy's Office of Science under cooperative agreements DE-FC02-07ER25800 and DE-FC02-06ER25762; • equipment purchased in part with funds from NSF Grant CNS-0421109; • resources at the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357; • resources at the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. * * * While academic supervision and financial support are necessary for dissertation research, they are not sufficient. To my parents, who lived like sojourners for their children; and to my grandfather Jack, who wanted to see this day: this dissertation is dedicated to you. To my wife, two sons and baby: we let the wind sweep away the world's wisdom and, despite a shoestring budget and some competition between midnight baby sitting and midnight paper writing, have been the happier for it. And finally, would science be possible without a starting point? For all knowledge proceeds from faith of whatever kind. You lean on God, you proceed from your own ego, or you hold fast to your ideal. The person who does not believe does not exist. At the very least, one who had nothing standing immediately firm before him could not find a point for his thinking to even begin. And how could someone whose thinking lacked a starting point ever investigate something scientifically? Abraham Kuyper, October 20, 1880. [24, p. 486] Contents 1 Introduction 1 2 A Methodology for Performance Analysis 9 2.1 Introduction 9 2.2 Principles of Performance Analysis 10 2.3 From Principles to Practical Methods 16 2.3.1 Measurement 18 2.3.2 Attribution 22 2.3.3 Analysis 24 2.3.4 Presentation 25 2.4 Related Work 30 2.5 Discussion 35 3 Measurement & Attribution: Fully Optimized Applications 37 3.1 Introduction 37 3.2 Binary Analysis for Call Path Profiling 44 3.2.1 Inferring Procedure Bounds 46 3.2.2 Computing Unwind Recipes 50 3.2.3 Evaluation 53 3.3 Binary Analysis for Source-Level Attribution 60 3.3.1 Recovering the Procedure Hierarchy 62 3.3.2 Recovering Alien Contexts 65 3.3.3 Recovering Loop Nests 67 3.3.4 Normalization 72 3.3.5 Summary 73 3.4 Putting It All Together 73 3.4.1 MOAB 74 3.4.2 S3D 75 3.5 Related Work 77 3.6 Discussion 79 4 Measurement &: Attribution: Logical Call Path Profiling 81 4.1 Introduction 81 4.2 The Challenges of Work Stealing 84 4.3 Logical Call Path Profiles 88 4.3.1 Logical Call Paths 89 4.3.2 Representing Logical Call Path Profiles 92 l 4.4 Obtaining Logical Call Path Profiles 93 4.4.1 Logical Stack Unwinding 94 4.4.2 Thread Creation Contexts 95 4.4.3 An API for Logical Unwinding 95 4.5 Logical Call Path Profiles of Cilk Executions 98 4.6 Related Work 100 4.7 Discussion 101 5 Analysis of Multithreaded Executions: Work Stealing 103 5.1 Introduction 103 5.2 Pinpointing Parallel Bottlenecks 105 5.2.1 Quantifying Insufficient Parallelism 105 5.2.2 Quantifying Parallelization Overhead 108 5.2.3 Analyzing Efficiency 110 5.3 Measurement and Analysis of Cilk Executions Ill 5.3.1 Parallel Work and Idleness 112 5.3.2 Parallel Overhead 112 5.3.3 Case Study 115 5.4 Related Work 119 5.5 Discussion 122 6 Analysis of Multithreaded Executions: Lock Contention 123 6.1 Introduction 123 6.2 Attributing Idleness to its Calling Context 125 6.2.1 A Straightforward Strategy 125 6.2.2 Blocking (Sleep-waiting) 127 6.2.3 Evaluation 127 6.3 Blaming Idleness on Lock-holders 129 6.3.1 Extending a Prior Strategy 129 6.3.2 Making It Practical 131 6.3.3 Evaluation 133 6.4 Communicating
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages290 Page
-
File Size-