CHABBI-DOCUMENT-2015.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Software Support for Efficient Use of Modern Computer Architectures by Milind Chabbi Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memory hierarchies make modern architectures difficult to program efficiently. Achieving top performance on supercomputers is difficult due to complex hardware, software, and their interactions. Production software systems fail to achieve top performance on modern architectures broadly due to three main causes: resource idleness, parallel overhead,anddata movement overhead.Thisdissertationpresentsnovelande↵ectiveperformance analysis tools, adap- tive runtime systems,andarchitecture-aware algorithms to understand and address these problems. Many future high performance systems will employ traditional multicore CPUs aug- mented with accelerators such as GPUs. One of the biggest concerns for accelerated systems is how to make best use of both CPU and GPU resources. Resource idleness arises in a parallel program due to insufficient parallelism and load imbalance, among other causes. To assess systemic resource idleness arising in GPU-accelerated architectures, we developed efficient profiling and tracing capabilities. We introduce CPU-GPU blame shifting—a novel technique to pinpoint and quantify the causes of resource idleness in GPU-accelerated archi- tectures. Parallel overheads arise due to synchronization constructs such as barriers and locks used in parallel programs. We developed a new technique to identify and eliminate redundant barriers at runtime in Partitioned Global Address Space programs. In addition, we developed a set of novel mutual exclusion algorithms that exploit locality in the memory hierarchy to improve performance on Non-Uniform Memory Access architectures. In modern architectures, inefficient or unnecessary memory accesses can severely degrade program performance. To pinpoint and quantify wasteful memory operations, we developed a fine-grain execution-monitoring framework. We extended this framework and demonstrated the feasibility of attributing fine-grain execution metrics to source and data in their contexts for long running programs—a task previously thought to be infeasible. The solutions described in this dissertation were employed to gain insights into the per- formance of a collection of important programs—both parallel and serial. The insights we gained enabled us to improve the performance of several important programs by a signifi- cant margin. Software for future systems will benefit from the techniques described in this dissertation. Acknowledgments I wish to thank my thesis committee: Prof. John Mellor-Crummey, Prof. Vivek Sarkar, Prof. Peter Varman, and Dr. Costin Iancu. It is exceptionally difficult to acknowledge the help of my advisor Prof. John Mellor- Crummey, which has extended beyond the time I spent at Rice University and beyond the academic realm. Prof. Mellor-Crummey’s paper about the MCS lock, which I read in 2006 during a course under Prof. Greg Andrews at The University of Arizona, sparked my interest in shared-memory synchronization. It was the same year the paper won the prestigious Edsger W. Dijkstra Prize in Distributed Computing. I had my first interaction with Prof. Mellor-Crummey in the Fall of 2008, when I emailed him with my interest in his work. Prof. Mellor-Crummey’s prompt reply and the ongoing PACE project encouraged me to consider Rice seriously. Although I was admitted to Rice in 2009, I could not join that year due to immigration complications. Prof. Mellor-Crummey was patient to wait until I joined Rice in the Fall of 2010. In the months and years that followed, I would only see Prof. Mellor-Crummey’s abilities with awe: how he pulled a reference to a related work in matter of seconds and how he remembered author’s names, paper titles, and processor architectures. Early years of the graduate studies were filled with many “fake eureka” moments, which IwouldsharewithProf.Mellor-Crummey,hewouldpatientlylistentomy“ideas”ofrein- venting the wheel, and respectfully point me to a work in 1990’s that had already solved it, in a better way on many occasions. As the time progressed, some seeds of ideas grew and started showing promise. Over the years Prof. Mellor-Crummey has helped me set higher standards in my research goals and impact. He helped me turn my half-baked ideas into concrete solutions. He would ask me the right set of questions, in search of answers through which I would learn a great deal about my own work, improve my ideas, and know more about the state-of-the-art. In stark contrast with my experience in industry, where doing anything outside the defined boundaries was a taboo, working under Prof. Mellor-Crummey, v there is no research question that is not worth pursuing. I am fortunate to have explored compilers, heterogenous architectures, high performance computing, binary analysis, per- formance analysis, data race detection, model checking, synchronization, computer security, and several other aspects in my graduate career. Having started my graduate career with inspiration from the MCS lock, it was my desire to contribute in that direction. We came close to doing something related to locks multiple times during the last three years: binary analysis for identifying mutual exclusion on one occasion and building efficient reader-writer locks on two occasions. However, we had to give up both because it was too hard to do or already been done very recently. The insight came when Prof. Mellor-Crummey and I were looking at lock contention in NWChem. Prof. Mellor-Crummey came up with the idea of batching requests from the same node when there were local contenders; I was already aware of lock cohorting. After that we both envisioned an MCS lock with multiple levels of hierarchy to take advantage of NUMA machines. Subse- quently, we devised the HMCS lock, showed its superiority both theoretically and empirically. This work has only improved over time with new design for an adaptive HMCS lock and more. I am indebted to Prof. Mellor-Crummey for always holding the bar higher for me to reach newer heights. Beyond research, Prof. Mellor-Crummey has helped me in my presentation skills, con- nected me with several researchers, and helped plan my vacations too. IwouldliketothanktheHPCToolkit research sta↵: Laksono Adhianto, Mark Krentel, and Mike Fagan. Without their constant support this work would not have been possible. IwouldliketocalloutMikeFaganforcollaboratinginmanyofmyprojectsandbeing extremely helpful in paper writing and solving both computer science and mathematical problems. I want to thank my collaborators Costin Iancu, Wim Lavrijsen, and Wibe de Jong from Lawrence Berkeley National Laboratory. I am indebted to Costin for his warmth, connecting me with people, and helping with my job search. I would like to thank Dr. Tracy Volz at Rice University for training me on many occasions for my presentations and providing me with the opportunity to train other students at Rice. This acknowledgement will be incomplete without mentioning the great support form vi the administrative sta↵of the department of computer science at Rice University. I would like to particularly thank Belia Martinez for her warmth and friendliness, which make her the friend of all graduate students. Belia also deserves accolades for helping me with my immigration process. This acknowledgement will be incomplete without mentioning the constant feedback that fellow graduate students provided in my talks, research work, and publications, including this dissertation. Reshmy, Rajesh, Aarthi, Shruti, Priyanka, Deepak, Karthik, Rahul, Gaurav, and Rajoshi, who became my Rice family had a positive role to play. They made my years in Houston at Rice memorable and fun filled. I would like to thank Girish for being a close friend in Houston. Few people would shamelessly eat dinner at a friend’s place as many times as I have eaten in Girish’s house. Alargeroundofthankstomyfather,mother,andbrother.Theyhavestoodwithme through my good and bad times. They have encouraged me to accomplish my dreams, shown confidence in my aspirations, and provided moral, spiritual, and financial support at all times. I would like to thank my wife for her unconditional support in my endeavors. I wish to thank the funding agencies that supported my work: Defense Advanced Research Projects Agency (DARPA), U.S. Department of Energy (DOE), Texas Instruments, Ken Kennedy Institute for Information Technology, and Lawrence Berkeley National Laboratory. I am thankful to the world class computational facilities provided by XSEDE, U.S. DOE laboratories, and Rice University’s Research Computing Support Group, which made this research possible. Specifically, this work is funded in part by the Defense Advanced Research Projects Agency through AFRL Contract FA8650-09-C-7915, DOE Office of Science under Cooper- ative Agreements DE-FC02-07ER25800, DE-FC02-13ER26144, DE-FC02-12ER26119, DE- SC0008699, and DE-SC0010473, Sandia National Laboratory purchase order 1293383, and Lawrence Berkeley National Laboratory subcontract 7106218. This research used the resources provided by the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725, Keeneland Computing Facility at the Georgia vii Institute of Technology supported by the National Science Foundation under Contract OCI- 0910735, Data Analysis and Visualization Cyberinfrastructure funded by NSF under grant OCI-0959097, Blacklight