Automatically Tuning Collective Communication for One-Sided Programming Models

Total Page:16

File Type:pdf, Size:1020Kb

Automatically Tuning Collective Communication for One-Sided Programming Models Automatically Tuning Collective Communication for One-Sided Programming Models Rajesh Nishtala Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-168 http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-168.html December 15, 2009 Copyright © 2009, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Automatically Tuning Collective Communication for One-Sided Programming Models by Rajesh Nishtala A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Katherine A. Yelick, Chair Professor James W. Demmel Professor Panayiotis Papadopoulos Fall 2009 Automatically Tuning Collective Communication for One-Sided Programming Models Copyright 2009 by Rajesh Nishtala 1 Abstract Automatically Tuning Collective Communication for One-Sided Programming Models by Rajesh Nishtala Doctor of Philosophy in Computer Science University of California, Berkeley Professor Katherine A. Yelick, Chair Technology trends suggest that future machines will rely on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many parallel programming models provide communication building blocks called collective communication. These operations, such as Broadcast, Scatter, Gather, and Reduce, ab- stract common global data movement patterns behind a simple library interface allowing the hardware and runtime system to optimize them for performance and scalability. We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional shared memory programming models, they deliver the benefits of sophisticated distributed data structures using language extensions and one-sided communication. One-sided communication allows one processor to directly read and write memory associated with another. Many popular PGAS language implementations share a common runtime system called GASNet for implementing such communication. To provide a highly scalable platform for our work, we present a new implementation of GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors. We demonstrate that PGAS languages are highly scalable and that the one-sided com- munication within them is an efficient and convenient platform for collective communication. We show how to use one-sided communication to achieve 3 improvements in the latency and throughput of the collectives over standard message passing× implementations. Using a 3D FFT as a representative communication bound benchmark, for example, we see a 17% increase in performance on 32,768 cores of the BlueGene/P and a 1.5 improvement on 1024 cores of the CrayXT4. We also show how the automatically tuned× collectives can deliver more than an order of magnitude in performance over existing implementations on shared memory platforms. There is no obvious best algorithm that serves all machines and usage patterns demon- strating the need for tuning and we thus build an automatic tuning system in GASNet that optimizes the collectives for a variety of large scale supercomputers and novel multicore architectures. To understand the large search space, we construct analytic performance 2 models use them to minimize the overhead of autotuning. We demonstrate that autotun- ing is an effective approach to addressing performance optimizations on complex parallel systems. i Dedicated to Rakhee, Amma, and Nanna for all their love and encouragement ii Contents List of Figures v 1 Introduction 1 1.1 Related Work ................................... 3 1.1.1 Automatically Tuning MPI Collective Communication ........ 3 1.2 Contributions ................................... 4 1.3 Outline ....................................... 5 2 Experimental Platforms 7 2.1 Processor Cores .................................. 8 2.2 Nodes ....................................... 8 2.2.1 Node Architectures ............................ 8 2.2.2 Remote Direct Memory Access ..................... 11 2.3 Interconnection Networks ............................. 12 2.3.1 CLOS Networks .............................. 12 2.3.2 Torus Networks .............................. 17 2.4 Summary ..................................... 17 3 One-Sided Communication Models 19 3.1 Partitioned Global Address Space Languages .................. 19 3.1.1 UPC .................................... 20 3.1.2 One-sided Communication ........................ 22 3.2 GASNet ...................................... 23 3.2.1 GASNet on top of the BlueGene/P ................... 23 3.2.2 Active Messages .............................. 28 4 Collective Communication 29 4.1 The Operations .................................. 29 4.1.1 Why Are They Important? ........................ 30 4.2 Implications of One-Sided communication for Collectives ........... 31 4.2.1 Global Address Space and Synchronization ............... 31 4.2.2 Current Set of Synchronization Flags .................. 32 4.2.3 Synchronization Flags: Arguments for and Against .......... 33 4.2.4 Optimizing the Synchronization and Collective Together ....... 34 iii 4.3 Collectives Used in Applications ......................... 35 5 Rooted Collectives for Distributed Memory 38 5.1 Broadcast ..................................... 39 5.1.1 Leveraging Shared Memory ....................... 39 5.1.2 Trees .................................... 40 5.1.3 Address Modes .............................. 49 5.1.4 Data Transfer ............................... 51 5.1.5 Nonblocking Collectives ......................... 52 5.1.6 Hardware Collectives ........................... 55 5.1.7 Comparison with MPI .......................... 56 5.2 Other Rooted Collectives ............................. 58 5.2.1 Scratch Space ............................... 58 5.2.2 Scatter ................................... 59 5.2.3 Gather ................................... 60 5.2.4 Reduce ................................... 62 5.3 Performance Models ............................... 64 5.3.1 Scatter ................................... 66 5.3.2 Gather ................................... 69 5.3.3 Broadcast ................................. 69 5.3.4 Reduce ................................... 72 5.4 Application Examples .............................. 72 5.4.1 Dense Matrix Multiplication ....................... 74 5.4.2 Dense Cholesky Factorization ...................... 76 5.5 Summary ..................................... 78 6 Non-Rooted Collectives for Distributed Memory 80 6.1 Exchange ..................................... 80 6.1.1 Performance Model ............................ 82 6.1.2 Nonblocking Collective Performance ................... 86 6.2 Gather-to-All ................................... 86 6.2.1 Performance Model ............................ 89 6.2.2 Nonblocking Collective Performance ................... 89 6.3 Application Example: 3D FFT ......................... 90 6.3.1 Packed Slabs ............................... 92 6.3.2 Slabs .................................... 94 6.3.3 Summary ................................. 94 6.3.4 Performance Results ........................... 95 6.4 Summary ..................................... 99 7 Collectives for Shared Memory Systems 101 7.1 Non-rooted Collective: Barrier .......................... 102 7.2 Rooted Collectives ................................ 106 7.2.1 Reduce ................................... 106 iv 7.2.2 Other Rooted Collectives ......................... 110 7.3 Application Example: Sparse Conjugate Gradient ............... 113 7.4 Summary ..................................... 115 8 Software Architecture of the Automatic Tuner 117 8.1 Related Work ................................... 118 8.2 Software Architecture ............................... 119 8.2.1 Algorithm Index ............................. 119 8.2.2 Phases of the Automatic Tuner ..................... 120 8.3 Collective Tuning ................................. 122 8.3.1 Factors that Influence Performance ................... 122 8.3.2 Offline Tuning ............................... 123 8.3.3 Online Tuning ............................... 124 8.3.4 Performance Models ........................... 125 8.4 Summary ..................................... 129 9 Teams 132 9.1 Thread-Centric Collectives ............................ 132 9.1.1 Similarities and Differences with MPI .................. 133 9.2 Data-Centric Collectives ............................. 134 9.2.1 Proposed Collective Model ........................ 135 9.2.2 An Example Interface .......................... 135 9.2.3 Application Examples .......................... 138 9.3 Automatic Tuning with Teams .......................... 142 9.3.1 Current Status .............................. 144 10 Conclusion 146 Bibliography 150 v List of Figures 2.1 Sun Constellation Node Architecture ...................... 9 2.2 Cray XT4 Node Architecture .......................... 10 2.3 Cray XT5 Node Architecture .......................... 10 2.4 IBM BlueGene/P Node Architecture .....................
Recommended publications
  • Contents U U U
    Contents u u u ACM Awards Reception and Banquet, June 2018 .................................................. 2 Introduction ......................................................................................................................... 3 A.M. Turing Award .............................................................................................................. 4 ACM Prize in Computing ................................................................................................. 5 ACM Charles P. “Chuck” Thacker Breakthrough in Computing Award ............. 6 ACM – AAAI Allen Newell Award .................................................................................. 7 Software System Award ................................................................................................... 8 Grace Murray Hopper Award ......................................................................................... 9 Paris Kanellakis Theory and Practice Award ...........................................................10 Karl V. Karlstrom Outstanding Educator Award .....................................................11 Eugene L. Lawler Award for Humanitarian Contributions within Computer Science and Informatics ..........................................................12 Distinguished Service Award .......................................................................................13 ACM Athena Lecturer Award ........................................................................................14 Outstanding Contribution
    [Show full text]
  • The Landscape of Parallel Computing Research: a View from Berkeley
    The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovic Ras Bodik Bryan Christopher Catanzaro Joseph James Gebis Parry Husbands Kurt Keutzer David A. Patterson William Lester Plishker John Shalf Samuel Webb Williams Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006 Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement Wed like to thank the following who participated in at least some of these meetings: Jim Demmel, Jike Chong, Armando Fox, Joe Hellerstein, Mike Jordan, Dan Klein, Bill Kramer, Rose Liu, Lenny Oliker, Heidi Pan, and John Wawrzynek. The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovíc, Rastislav Bodik, Bryan Catanzaro, Joseph Gebis, Parry Husbands, Kurt Keutzer, David Patterson, William Plishker, John Shalf, Samuel Williams, and Katherine Yelick December 18, 2006 Abstract The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence.
    [Show full text]
  • Katherine A. Yelick Robert S. Pepper Distinguished Professor Of
    Katherine A. Yelick Robert S. Pepper Distinguished Professor of Electrical Engineering and Computer Sciences Executive Associate Dean, Division of Computing, Data Science, and Society University of California, Berkeley Senior Faculty Scientist, Lawrence Berkeley National Laboratory 771 Soda Hall University of California, Berkeley Berkeley, CA 94720-1776 Phone (510) 642-8900 Email: [email protected] Education 1985 B.S. and M.S. in Computer Science, Massachusetts Institute of Technology 1991 Ph.D. in Computer Science, Massachusetts Institute of Technology Experience University of California, Berkeley (1991-present) Executive Associate Dean, Division of Computing, Data Science and Society (effective July 2021) Associate Dean for Research, Division of Computing, Data Science, and Society (2020-2021) Robert S. Pepper Distinguished Professor, Electrical Engineering and Computer Sciences (2019- present) Professor, Electrical Engineering and Computer Sciences (2002-present) Associate Professor, Electrical Engineering and Computer Sciences (1996-2002) Assistant Professor, Electrical Engineering and Computer Sciences (1991-1996) Lawrence Berkeley National Laboratory (1996-present) Senior Faculty Scientist (2008-present) Senior Advisor on Computing (2020-2021) Associate Laboratory Director for Computing Sciences (2010-2019) National Energy Research Scientific Computing (NERSC) Division Director (2008-2012) Future Technologies Group Lead (2005-2007) Faculty Research Scientist (1996-2005) ETH, Zurich, Switzerland (Summer, 1996) Visiting Researcher
    [Show full text]
  • And Co-PRINCIPAL INVESTIGATORS/Co-PROJECT DIRECTORS Submit Only ONE Copy of This Form for Each PI/PD and Co-PI/PD Identified on the Proposal
    INFORMATION ABOUT PRINCIPAL INVESTIGATORS/PROJECT DIRECTORS(PI/PD) and co-PRINCIPAL INVESTIGATORS/co-PROJECT DIRECTORS Submit only ONE copy of this form for each PI/PD and co-PI/PD identified on the proposal. The form(s) should be attached to the original proposal as specified in GPG Section II.B. DO NOT INCLUDE THIS FORM WITH ANY OF THE OTHER COPIES OF YOUR PROPOSAL AS THIS MAY COMPRISE THE CONFIDENTIALITY OF THE INFORMATION. PI/PD Name: Susan L Graham Gender: Male Female Ethnicity: (Choose one response) Hispanic or Latino Not Hispanic or Latino Race: American Indian or Alaska Native (Select one or more) Asian Black or African American Native Hawaiian or Other Pacific Islander White Disability Status: Hearing Impairment (Select one or more) Visual Impairment Mobility/Orthopedic Impairment Other None Citizenship: (Choose one) U.S. Citizen Permanent Resident Other non-U.S. Citizen Check here if you do not wish to provide any or all of the above information (excluding PI/PD name): REQUIRED: Check here if you are currently serving (or have previously served) as a PI, co-PI or PD on any federally funded project Ethnicity Definition: Hispanic or Latino. A person of Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. Race Definitions: American Indian or Alaska Native. A person having origins in any of the original peoples of North and South America (including Central America), and who maintains tribal affiliation or community attachment. Asian. A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam.
    [Show full text]
  • Association for Computing Machinery 2 Penn Plaza, Suite 701, New York
    ACM A N W N A N R N D N S Association for Computing Machinery 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA +1-212-869-7440 www.acm.org Contents N N N ACM Awards Reception and Banquet, June 2017 . .2 Introduction . .3 A.M. Turing Award . .4 ACM Prize in Computing . .5 ACM – AAAI Allen Newell Award . .6 Software System Award . .7 Grace Murray Hopper Award . .8 Paris Kanellakis Theory and Practice Award . .9 Karl V. Karlstrom Outstanding Educator Award . .10 ACM Policy Award . .11 Distinguished Service Award . .12 ACM Athena Lecturer Award . .13 Outstanding Contribution to ACM Award . .14 ACM Presidential Award . .15-17 Doctoral Dissertation Award . .18 ACM Student Research Competition . .19 ACM Fellows . .20 Eugene L. Lawler Award for Humanitarian Contributions within Computer Science and Informatics . .21 ACM Gordon Bell Prize . .21 ACM – IEEE CS Eckert-Mauchly Award . .22 ACM – IEEE CS Ken Kennedy Award . .22 ACM – IEEE CS George Michael HPC Memorial Fellowships . .23 SIAM – ACM Prize in Computational Science and Engineering . .23 ACM – CSTA Cutler-Bell Prize . .24 ACM India Doctoral Dissertation Award . .24 ACM China Doctoral Dissertation Award . .25 ACM China Rising Star Award . .25 IPSJ/ACM Award for Early Career Contributions to Global Research . .25 ACM Special Interest Group Awards . .26-27 2017 ACM Award Subcommittee Chairs . .28 ACM Award Nomination Submission Procedures . .29 2018 ACM Award Subcommittee Chairs and Members . .30-31 Past Recipients . .32-36 ACM Fellows . .37-43 In Memoriam, ACM Fellows . .44 1 ACM Awards Reception & Banquet ACM AWARDS N N N N N N The Westin St.
    [Show full text]
  • The Landscape of Parallel Computing Research: a View from Berkeley
    The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovic Ras Bodik Bryan Christopher Catanzaro Joseph James Gebis Parry Husbands Kurt Keutzer David A. Patterson William Lester Plishker John Shalf Samuel Webb Williams Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006 Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement Wed like to thank the following who participated in at least some of these meetings: Jim Demmel, Jike Chong, Armando Fox, Joe Hellerstein, Mike Jordan, Dan Klein, Bill Kramer, Rose Liu, Lenny Oliker, Heidi Pan, and John Wawrzynek. The Landscape of Parallel Computing Research: A View from Berkeley Krste Asanovíc, Rastislav Bodik, Bryan Catanzaro, Joseph Gebis, Parry Husbands, Kurt Keutzer, David Patterson, William Plishker, John Shalf, Samuel Williams, and Katherine Yelick December 18, 2006 Abstract The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence.
    [Show full text]