Parallel Programming in Openmp About the Authors

Parallel Programming in OpenMP About the Authors Rohit Chandra is a chief scientist at NARUS, Inc., a provider of internet business infrastructure solutions. He previously was a principal engineer in the Compiler Group at Silicon Graphics, where he helped design and implement OpenMP. Leonardo Dagum works for Silicon Graphics in the Linux Server Platform Group, where he is responsible for the I/O infrastructure in SGI’s scalable Linux server systems. He helped define the OpenMP Fortran API. His research interests include parallel algorithms and performance modeling for parallel systems. Dave Kohr is a member of the technical staff at NARUS, Inc. He previously was a member of the technical staff in the Compiler Group at Silicon Graphics, where he helped define and implement the OpenMP. Dror Maydan is director of software at Tensilica, Inc., a provider of appli- cation-specific processor technology. He previously was an engineering department manager in the Compiler Group of Silicon Graphics, where he helped design and implement OpenMP. Jeff McDonald owns SolidFX, a private software development company. As the engineering department manager at Silicon Graphics, he proposed the OpenMP API effort and helped develop it into the industry standard it is today. Ramesh Menon is a staff engineer at NARUS, Inc. Prior to NARUS, Ramesh was a staff engineer at SGI, representing SGI in the OpenMP forum. He was the founding chairman of the OpenMP Architecture Review Board (ARB) and supervised the writing of the first OpenMP specifica- tions. Parallel Programming in OpenMP Rohit Chandra Leonardo Dagum Dave Kohr Dror Maydan Jeff McDonald Ramesh Menon Senior Editor Denise E. M. Penrose Senior Production Editor Edward Wade Editorial Coordinator Emilia Thiuri Cover Design Ross Carron Design Cover Image © Stone/Gary Benson Text Design Rebecca Evans & Associates Technical Illustration Dartmouth Publishing, Inc. Composition Nancy Logan Copyeditor Ken DellaPenta Proofreader Jennifer McClain Indexer Ty Koontz Printer Courier Corporation Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances where Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. ACADEMIC PRESS A Harcourt Science and Technology Company 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA http://www.academicpress.com Academic Press Harcourt Place, 32 Jamestown Road, London, NW1 7BY, United Kingdom http://www.academicpress.com Morgan Kaufmann Publishers 340 Pine Street, Sixth Floor, San Francisco, CA 94104-3205, USA http://www.mkp.com © 2001 by Academic Press All rights reserved Printed in the United States of America 05 04 03 02 01 5 4 3 2 1 No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without the prior writ- ten permission of the publisher. Library of Congress Cataloging-in-Publication Data is available for this book. ISBN 1-55860-671-8 This book is printed on acid-free paper. We would like to dedicate this book to our families: Rohit—To my wife and son, Minnie and Anand, and my parents Leo—To my lovely wife and daughters, Joanna, Julia, and Anna Dave—To my dearest wife, Jingjun Dror—To my dearest wife and daughter, Mary and Daniella, and my parents, Dalia and Dan Jeff—To my best friend and wife, Dona, and my parents Ramesh—To Beena, Abishek, and Sithara, and my parents This Page Intentionally Left Blank Foreword by John L. Hennessy President, Stanford University FOR A NUMBER OF YEARS, I have believed that advances in software, rather than hardware, held the key to making parallel computing more commonplace. In particular, the lack of a broadly supported standard for programming shared-memory multiprocessors has been a chasm both for users and for software vendors interested in porting their software to these multiprocessors. OpenMP represents the first vendor-independent, commercial “bridge” across this chasm. Such a bridge is critical to achieve portability across different shared- memory multiprocessors. In the parallel programming world, the chal- lenge is to obtain both this functional portability as well as performance portability. By performance portability, I mean the ability to have reason- able expectations about how parallel applications will perform on different multiprocessor architectures. OpenMP makes important strides in enhancing performance portability among shared-memory architectures. Parallel computing is attractive because it offers users the potential of higher performance. The central problem in parallel computing for nearly 20 years has been to improve the “gain to pain ratio.” Improving this ratio, with either hardware or software, means making the gains in performance come at less pain to the programmer! Shared-memory multiprocessing was developed with this goal in mind. It provides a familiar programming model, allows parallel applications to be developed incrementally, and vii viii Foreword supports fine-grain communication in a very cost effective manner. All of these factors make it easier to achieve high performance on parallel machines. More recently, the development of cache-coherent distributed shared memory has provided a method for scaling shared-memory architectures to larger numbers of processors. In many ways, this development removed the hardware barrier to scalable, shared-memory multiprocessing. OpenMP represents the important step of providing a software standard for these shared-memory multiprocessors. Our goal now must be to learn how to program these machines effectively (i.e., with a high value for gain/pain). This book will help users accomplish this important goal. By focusing its attention on how to use OpenMP, rather than on defining the standard, the authors have made a significant contribution to the important task of mastering the programming of multiprocessors. Contents Foreward, by John L. Hennessy vii Preface xiii Chapter 1 Introduction 1 1.1 Performance with OpenMP 2 1.2 A First Glimpse of OpenMP 6 1.3 The OpenMP Parallel Computer 8 1.4 Why OpenMP? 9 1.5 History of OpenMP 13 1.6 Navigating the Rest of the Book 14 Chapter 2 Getting Started with OpenMP 15 2.1 Introduction 15 2.2 OpenMP from 10,000 Meters 16 2.2.1 OpenMP Compiler Directives or Pragmas 17 2.2.2 Parallel Control Structures 20 2.2.3 Communication and Data Environment 20 2.2.4 Synchronization 22 2.3 Parallelizing a Simple Loop 23 2.3.1 Runtime Execution Model of an OpenMP Program 24 2.3.2 Communication and Data Scoping 25 ix x Contents 2.3.3 Synchronization in the Simple Loop Example 27 2.3.4 Final Words on the Simple Loop Example 28 2.4 A More Complicated Loop 29 2.5 Explicit Synchronization 32 2.6 The reduction Clause 35 2.7 Expressing Parallelism with Parallel Regions 36 2.8 Concluding Remarks 39 2.9 Exercises 40 Chapter 3 Exploiting Loop-Level Parallelism 41 3.1 Introduction 41 3.2 Form and Usage of the parallel do Directive 42 3.2.1 Clauses 43 3.2.2 Restrictions on Parallel Loops 44 3.3 Meaning of the parallel do Directive 46 3.3.1 Loop Nests and Parallelism 46 3.4 Controlling Data Sharing 47 3.4.1 General Properties of Data Scope Clauses 49 3.4.2 The shared Clause 50 3.4.3 The private Clause 51 3.4.4 Default Variable Scopes 53 3.4.5 Changing Default Scoping Rules 56 3.4.6 Parallelizing Reduction Operations 59 3.4.7 Private Variable Initialization and Finalization 63 3.5 Removing Data Dependences 65 3.5.1 Why Data Dependences Are a Problem 66 3.5.2 The First Step: Detection 67 3.5.3 The Second Step: Classification 71 3.5.4 The Third Step: Removal 73 3.5.5 Summary 81 3.6 Enhancing Performance 82 3.6.1 Ensuring Sufficient Work 82 3.6.2 Scheduling Loops to Balance the Load 85 3.6.3 Static and Dynamic Scheduling 86 3.6.4 Scheduling Options 86 3.6.5 Comparison of Runtime Scheduling Behavior 88 3.7 Concluding Remarks 90 3.8 Exercises 90 Contents xi Chapter 4 Beyond Loop-Level Parallelism: Parallel Regions 93 4.1 Introduction 93 4.2 Form and Usage of the parallel Directive 94 4.2.1 Clauses on the parallel Directive 95 4.2.2 Restrictions on the parallel Directive 96 4.3 Meaning of the parallel Directive 97 4.3.1 Parallel Regions and SPMD-Style Parallelism 100 4.4 threadprivate Variables and the copyin Clause 100 4.4.1 The threadprivate Directive 103 4.4.2 The copyin Clause 106 4.5 Work-Sharing in Parallel Regions 108 4.5.1 A Parallel Task Queue 108 4.5.2 Dividing Work Based on Thread Number 109 4.5.3 Work-Sharing Constructs in OpenMP 111 4.6 Restrictions on Work-Sharing Constructs 119 4.6.1 Block Structure 119 4.6.2 Entry and Exit 120 4.6.3 Nesting of Work-Sharing Constructs 122 4.7 Orphaning of Work-Sharing Constructs 123 4.7.1 Data Scoping of Orphaned Constructs 125 4.7.2 Writing Code with Orphaned Work-Sharing Constructs 126 4.8 Nested Parallel Regions 126 4.8.1 Directive Nesting and Binding 129 4.9 Controlling Parallelism in an OpenMP Program 130 4.9.1 Dynamically Disabling the parallel Directives 130 4.9.2 Controlling the Number of Threads 131 4.9.3 Dynamic Threads 133 4.9.4 Runtime Library Calls and Environment Variables 135 4.10 Concluding Remarks 137 4.11 Exercises 138 Chapter 5 Synchronization 141 5.1 Introduction 141 5.2 Data Conflicts and the Need for Synchronization 142 5.2.1 Getting Rid of Data Races 143 xii Contents 5.2.2 Examples of Acceptable

Parallel Programming in Openmp About the Authors

A Case for High Performance Computing with Virtual Machines

High Performance Computing Through Parallel and Distributed Processing

Heterogeneous Task Scheduling for Accelerated Openmp

Technical Report Aaron Councilman

Intel® Cilk™ Plus

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Parallel System Performance: Evaluation & Scalability

Exploring Massive Parallel Computation with GPU

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Dcuda: Hardware Supported Overlap of Computation and Communication