UPC Collective Optimization

I V N E R U S E I T H Y T O H F G E R D I N B U UPC Collective Optimization Savvas Petrou August 20, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009 Abstract Efficient collective operations are an important feature for new parallel programming languages. The size of massively parallel machines is increasing fast and the performance of these operations are crucial to the scalability of many applications. The Uni- fied Parallel C (UPC) language is one of the new parallel programming languages based on the Partitioned Global Address Space (PGAS) model. The purpose of this project is to investigate the performance and scaling of the current implementations of collective operations and also develop new implementations based on tree structure algorithms. For the benchmarks two codes for Computational at Fluid Dynamics (CFD) are used. This type of applications extensively use functions to gather and scatter information for various parameters of the simulations, something that makes them ideal for bench- marking collective operations. The benchmarks are executed on the United Kingdom National Supercomputing Service, HECToR. The system is a hybrid supercomputer consisting of a massively parallel scalar system and a vector system. Both systems are going to be tested to observe the behaviour of the algorithms on different architectures. Contents 1 Introduction 1 2 Background 3 2.1 UnifiedParallelC............................. 3 2.1.1 UPCCompilers.......................... 5 2.1.2 Collectivesimplementations . 7 2.2 Thesystem ................................ 8 2.2.1 HECToRCrayXT4........................ 8 2.2.2 HECToRCrayX2 ........................ 9 2.3 Applications................................ 10 2.3.1 BenchC.............................. 10 2.3.2 XFlow .............................. 12 3 Porting 14 3.1 BenchC .................................. 16 3.1.1 Preprocessorflags . .. .. .. .. .. .. 16 3.1.2 Timers .............................. 16 3.1.3 Headerfilesissue.. .. .. .. .. .. .. 18 3.1.4 Broadcastswithinall-reductions . 19 3.2 XFlow................................... 19 3.2.1 Collectiveoperationsfile . 19 3.2.2 Timers .............................. 20 3.2.3 Headerfilesissue.. .. .. .. .. .. .. 22 3.2.4 Broadcastswithinall-reductions . 22 3.3 Tests.................................... 22 4 Optimizations 23 4.1 Broadcast ................................. 23 4.1.1 DefaultCFDcodeimplementation. 24 4.1.2 Implementation1.. .. .. .. .. .. .. 25 4.1.3 Implementation2.. .. .. .. .. .. .. 27 4.1.4 Implementation3.. .. .. .. .. .. .. 28 4.1.5 GASNetimplementation . 29 4.1.6 MTUimplementation . 30 4.2 Reduction................................. 31 i 4.2.1 DefaultCFDcodeimplementation. 31 4.2.2 Implementation1.. .. .. .. .. .. .. 32 4.2.3 Implementation2.. .. .. .. .. .. .. 33 4.2.4 MTUimplementation . 34 5 Benchmarks and Analysis 36 5.1 Inputparameters ............................. 36 5.1.1 BenchC.............................. 36 5.1.2 XFlow .............................. 39 5.2 Benchmarks................................ 41 5.2.1 CrayXT ............................. 41 5.2.2 CrayX2.............................. 51 6 Conclusions and Future Work 60 6.1 Conclusions................................ 60 6.2 Futurework................................ 61 A HECToR modules 64 A.1 CrayXT4 ................................. 64 B Graphs 66 B.1 CrayXT.................................. 66 B.1.1 Broadcast............................. 66 B.1.2 Reduction............................. 70 B.2 CrayX2.................................. 73 B.2.1 Broadcast............................. 73 B.2.2 Reduction............................. 77 C Work Plan 80 ii List of Tables 3.1 Timers used in BenchC depending on programming standard, architec- tureandcompiler ............................. 17 4.1 Quad tree statistics for first broadcast implementation .......... 26 4.2 Quad tree statistics for second broadcast implementation ........ 27 5.1 InputparametersforBenchC . 37 5.2 Percent of time spent in collective operations [BenchC, XT4]...... 38 5.3 Percent of time spent in collective operations [BenchC, X2]....... 38 5.4 InputparametersforXFlow. 39 5.5 Percent of time spent in collective operations [XFlow, XT4] ...... 40 5.6 Percent of time spent in collective operations [XFlow, X2] ....... 40 iii List of Figures 2.1 UPCsharedvariablesexample . 4 2.2 AMDOpteronQuad-corearchitectureschematic . .... 9 2.3 BenchCexecutiondiagram . 11 4.1 Userbroadcastimplementation . 25 4.2 Firstimplementationofbroadcast . .. 26 4.3 Secondimplementationofbroadcast . .. 27 4.4 Thirdimplementationofbroadcast . .. 28 4.5 Binomialtreeoforder4. .. .. .. .. .. .. .. 29 4.6 MTU GET implementationofbroadcast . 30 4.7 Userimplementationofreduction . 32 4.8 Firstimplementationofreduction . .. 33 4.9 Secondimplementationofreduction . .. 34 5.1 Broadcast tree rank benchmarks [BenchC, XT4] . .... 43 5.2 All broadcast implementations comparison [BenchC, XT4]....... 45 5.3 All broadcast implementations comparison [XFlow, XT4] ........ 46 5.4 Reductiontreerankbenchmarks[BenchC,XT4]. .... 48 5.5 All reduction implementations comparison [BenchC, XT4]....... 49 5.6 All reduction implementations comparison [XFlow, XT4] ........ 50 5.7 Broadcast tree rank benchmarks [BenchC, X2] . ... 53 5.8 All broadcast implementations comparison [BenchC, X2] ........ 54 5.9 All broadcast implementations comparison [XFlow, X2] . ....... 55 5.10 Reductiontreerankbenchmarks[BenchC,X2] . .... 57 5.11 All reduction implementations comparison [BenchC, X2]........ 58 5.12 All reduction implementations comparison [XFlow, X2] ........ 59 B.1 All broadcast implementations comparison (tree rank 4) [BenchC, XT4] 66 B.2 All broadcast implementations comparison (tree rank 8) [BenchC, XT4] 67 B.3 Broadcast tree rank benchmarks [XFlow, XT4] . ... 68 B.4 All broadcast implementations comparison (tree rank 4) [XFlow,XT4] . 69 B.5 All broadcast implementations comparison (tree rank 8) [XFlow,XT4] . 69 B.6 All reduction implementations comparison (tree rank 4) [BenchC, XT4] 70 B.7 All reduction implementations comparison (tree rank 16) [BenchC, XT4] 70 B.8 Reductiontreerankbenchmarks[XFlow,XT4] . ... 71 iv B.9 All reduction implementations comparison (tree rank 4) [XFlow,XT4] . 72 B.10 All reduction implementations comparison (tree rank 16)[XFlow,XT4] 72 B.11 All broadcast implementations comparison (tree rank 4) [BenchC, X2] . 73 B.12 All broadcast implementations comparison (tree rank 8) [BenchC, X2] . 74 B.13 Broadcast tree rank benchmarks [XFlow, X2] . .... 75 B.14 All broadcast implementations comparison (tree rank 4)[XFlow,X2] . 76 B.15 All broadcast implementations comparison (tree rank 8)[XFlow,X2] . 76 B.16 All reduction implementations comparison (tree rank 4) [BenchC, X2] . 77 B.17 All reduction implementations comparison (tree rank 8) [BenchC, X2] . 77 B.18 Reductiontreerankbenchmarks[XFlow,X2] . .... 78 B.19 All reduction implementations comparison (tree rank 4)[XFlow,X2]. 79 B.20 All reduction implementations comparison (tree rank 8)[XFlow,X2]. 79 C.1 Ganttdiagram............................... 80 v Acknowledgements First of all, I would like to thank my supervisors, Dr Adam Carter and Jason Beech- Brandt for their continued support and supervision throughout this project. Also, I want to express my appreciation to my colleague and friend Christopher Charles Evi-Parker for his help during the proofreading stage. Last but not least, I thank my parents for making this Masters degree possible. Chapter 1 Introduction In parallel programming there are many factors that need to be considered in order to be able to develop an application that has good parallel characteristics. One of these factors is the efficient communication between the processes for performing collective operations. Depending on the data dependencies and the algorithms used, different operations are important for every application. For portability and performance reasons these operations are usually part of the parallel programming language specifications. This creates an abstraction layer and gives the opportunity to developers to write their own implementationsin cases where the default versions have poor performance. More- over the codes are able to execute on a variety of systems and architectures without the need of porting procedures. The Unified Parallel C (UPC) is one of the new parallel programming languages based on the Partitioned Global Address Space (PGAS) model. The main feature of the model is the use of single sided communications for data exchange between processes. This offers great flexibility to the developer and makes it possible to parallelise applications that have dynamic communication patterns. The United Kingdom National Supercomputing Service, known by the name HECToR, has recently installed a UPC compiler for both architectures of the system. HECToR is a Cray XT5h hybrid architecture system. It consists of a Cray XT4 massively parallel scalar system and a Cray X2 vector system. In addition to the installation of the UPC compiler the XT system was upgraded from dual core processors to quad cores. The multi-core architecture introduced new potential optimisations for collective functions. The purpose of the project is to benchmark the currently available implementations of collective operations and develop new versions in an effort to optimise them. For the benchmarks two applications for Computation at Fluid Dynamics (CFD) are going to be used. They are both developed by the same software development group,

UPC Collective Optimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support