Efficient Run-Time Support for Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems

Efficient Run-time Support For Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Darrell Brian Larkins, M.S. Department of Computer Science and Engineering The Ohio State University 2010 Dissertation Committee: P. Sadayappan, Advisor Atanas Rountev Paul A. G. Sivilotti © Copyright by Darrell Brian Larkins 2010 ABSTRACT Developing high-performance parallel applications that use linked data structures on distributed-memory clusters is challenging. Many scientific applications use algorithms based on linked data structures like trees and graphs. These structures are especially useful in representing relationships between data which may not be known until runtime or may otherwise evolve during the course of a computation. Meth- ods such as n-body simulation, Fast Multipole Methods (FMM), and multiresolution analysis all use trees to represent a fixed space populated by a dynamic distribu- tion of elements. Other problem domains, such as data mining, use both trees and graphs to summarize large input datasets into a set of relationships that capture the information in a form that lends itself to efficient mining. This dissertation first describes a runtime system that provides a programming interface to a global address space representation of generalized distributed linked data structures, while providing scalable performance on distributed memory computing systems. This system, the Global Chunk Layer (GCL), provides data access primitives at the element level, but takes advantage of coarse-grained data movement to enhance locality and improve communication efficiency. The key benefits of using the GCL system include efficient shared-memory style programming of distributed ii dynamic, linked data structures, the abstraction and optimization of structural elements common to linked data, and the ability to customize many aspects of the runtime to tune application performance. Additionally, this dissertation presents the design and implementation of a tree- specific system for efficient parallel global address space computing. The Global Trees (GT) library provides a global view of distributed linked tree structures and a set of routines that operate on these structures. GT is built on top of the generalized data structure support provided by the GCL runtime and can inter-operate with other parallel programming models such as MPI, or along with existing global view approaches such as Global Arrays. This approach is based on two key insights: First, tree-based algorithms are easily expressed in a fine-grained manner, but data movement must be done at a much coarser level of granularity for good performance. Second, since GT is focused on a single data abstraction, attributes unique to tree structures can be exploited to provide optimized routines for common operations. This dissertation also describes techniques for improving program performance and program understanding using these frameworks. Data locality has a significant impact on the communication properties of parallel algorithms. Techniques are presented that use profile-driven data reference traces to perform two types of data layout in Global Trees programs. Lastly, approaches for understanding and analyzing program performance data are presented along with tools for visualizing GT structures. These tools enable GT developers to better understand and optimize program performance. iii To my wonderful wife, Lynette, who has fulfilled all my dreams. To Rachel and Emily, two beautiful dreams that are still there when I wake. iv ACKNOWLEDGMENTS I am deeply grateful for the opportunity to study at The Ohio State University. Without the support and care of many people that I’ve met throughout my life, this endeavor wouldn’t have been possible. First and foremost, I thank my advisor, P. Sadayappan. He has supported my growth and development throughout this entire process. As a mentor and role-model, I have learned from his example as an excellent researcher and teacher. Other faculty at Ohio State have been influential in the way that I have come to think and reason about challenges that I face. Nasko Rountev, who has helped me to understand the discipline and process of research. Paul Sivilotti, who has forged many neural pathways regarding formal reasoning. Srini Parthasarathy, for your valuable insights about global view distributed programming. I especially thank Gerald Baumgartner, who helped repair the damage done by my undergraduate time at Ohio State. I also wish to thank my parents, Darrell and Shirley, for encouraging my inquis- itiveness and putting up with all my shenanigans. The encouragement of my sister, Cara, and my rabid Buckeye fan and brother-in-law, Joe, has also meant a great deal to me. Having spent a decade in industry before returning to academia, there are many people whose nudging and listening have helped with this accomplishment. Thanks v to Joe Judge and Gary Ellison at AT&T Bell Laboratories, for seeing a diamond in the rough. Jim Hoburg for many years of intense conversations and showing me that critical thinking can be a lifestyle and not just a skill. I am grateful for all of the good friends and colleagues that I have met along the way: Chad Maue, Matt Curtin, Larry Ogrodnek, Kyri Sarantakos, Seth Robertson, James Tanis, Alex Dupuy, Brian Lindauer, Matt Miller and Oliver Stockhammer. Special thanks to Jen Yates, Scott Alexander and Martin Jansche for sharing their insights and experience with the doctoral process. It’s difficult to imagine surviving this experience without the friends that I have made while in school. Matt Lang, your thoughts and humor have helped me preserve some sanity. Josh Levine and Jim Dinan, much thanks to each of you for both contributing to and distracting me from my progress with research. I would also like to thank my friends on the outside, Rick Martin, Melanie Fuller, and the many, many others who have aided me in keeping it all together for the last six years. Lastly, to my dearest wife, Lynette, who has sacrificed much to permit me to pursue my dream. Without you, I would be incomplete, I love you. To my children, Rachel and Emily, who can’t read yet: I love you too and hope that one day you will understand why daddy spent so much time pushing buttons and looking at a glass screen with letters on it. D. Brian Larkins Columbus, Ohio June 11, 2010 vi VITA September 13, 1971 . .Born – Cincinnati, OH 1996 . .B.S. Computer Science, The Ohio State University 2008 . .M.S. Computer Science, The Ohio State University 2005 — Present . Graduate Research Associate, The Ohio State University Research Publications James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sa- dayappan. “Scalable Work Stealing”. ACM/IEEE Conference on High Performance Computing (SC ’09). November 2009. D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Atanas Rountev, P. Sa- dayappan. “Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems”. ACM/IEEE Conference on High Performance Computing (SC ’08). November 2008. James Dinan, D. Brian Larkins, Jarek Nieplocha, P. Sadayappan. “Scioto: A Frame- work for Global-View Task Parallelism”. International Conference on Parallel Pro- cessing (ICPP ’08). September 2008. D. Brian Larkins. “Internet Routing and DNS Voodoo in the Enterprise.” Conference on Large Installation Systems Administration (USENIX LISA ’99). November 1999. Instructional Publications D. Brian Larkins, William Harvey. “Introductory Computational Science Using MAT- LAB and Image Processing”. International Workshop on Teaching Computational Science (WTCS 2010). May 2010. vii FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in: High Performance Parallel Systems Prof. P. Sadayappan Programming Languages and Software Engineering Prof. Atanas Rountev Distributed Systems Prof. Paolo A.G. Sivilotti Artificial Intelligence Prof. Eric Fosler-Lussier viii TABLE OF CONTENTS Page Abstract . ii Dedication . iv Acknowledgments . v Vita......................................... vii List of Figures . xiii List of Program Listings . xv List of Algorithms . xvii Chapters: 1. Background . 1 1.1 Overview of Tree and Graph Structures . 1 1.2 Parallel Computation with Linked Structures . 4 1.2.1 Data Models . 5 1.2.2 Control Models . 10 1.3 Application survey of tree and graph algorithms used in HPC . 11 2. Global Chunks: A Framework for General Linked Data Structures on Distributed Memory Parallel Systems . 14 2.1 Overview . 14 2.2 Programming Model . 16 2.2.1 Data Models . 16 2.2.2 Data Access Views . 19 ix 2.2.3 Data Consistency Models . 22 2.3 Programming Interface . 27 2.3.1 Core Programming Constructs . 27 2.3.2 Core Operations . 29 2.3.3 Data Consistency and Coherence Management . 34 2.3.4 Using Global Chunks . 35 2.3.5 Customizing Node Allocation . 38 2.4 Design and Implementation . 39 2.4.1 Global Pointers . 41 2.4.2 Caching and Buffering . 44 2.4.3 Custom Allocation . 48 2.4.4 Interoperability . 49 2.5 Experimental Evaluation . 50 2.5.1 Coarse-Grained Data Movement . 50 2.5.2 Global Pointer Overhead . 51 2.5.3 Chunk Global View . 52 3. Global Trees: Specialized Support for Global View Tree Structures . 54 3.1 Overview . 54 3.2 Programming Model and Interface . 56 3.2.1 Core Programming Constructs . 57 3.2.2 Programming Interface . 58 3.2.3 Using Global Trees . 63 3.2.4 Custom Node Allocation . 65 3.2.5 Tree Traversals . 66 3.3 Experimental Evaluation . 68 3.3.1 Cluster OpenMP . 69 3.3.2 Unified Parallel C . 69 3.3.3 Global Trees Scalability . 70 3.3.4 Barnes-Hut Performance Analysis . 71 4. Data Locality in Global Trees . 75 4.1 Overview . 75 4.1.1 Approaches . 76 4.1.2 Contributions . 78 4.2 Analysis . 79 4.2.1 Reducing Communication Time . 80 4.2.2 Reducing Global Trees Overhead .

Efficient Run-Time Support for Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems

Device Tree 101

Slide Set 17 for ENCM 339 Fall 2015

Fast Sequential Summation Algorithms Using Augmented Data Structures

Linked List Data Structure

Trees • Binary Trees • Traversals of Trees • Template Method Pattern • Data

Verifying Linked Data Structure Implementations

8.1 Doubly Linked List

Linked Data Structures 17

MASTER of SCIENCE May,1995 DESIGN and IMPLEMENTATION of an EFFICIENT

Lecture 8: 8.1 Doubly Linked List

Push Vs. Pull: Data Movement for Linked Data Structures

15. Symbol Tables