Communicating Sequential Processes in High Performance Computing

Communicating Sequential Processes in High Performance Computing Alasdair James Mackenzie King August 20, 2015 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2015 Abstract Communicating Sequential Processes has been used in High Performance Computing with suc- cess and offers benefits in providing a model for safe parallel and concurrent programs. Its mathematical basis allows for programs to be evaluated and proved for there completeness, insuring the program is free of common problems found within parallel and concurrent programming. Modern parallel programs are mainly implemented in MPI or OpenMP. These tools have had a long time to mature and provide programers with very high performing programs for distributed and shared memory. This project aims to use CSP with a high level language to evaluate the usefulness and ability of the model and higher level languages. Python is a popular language and its use within HPC is increasing. An implementation of CSP should provide Python with the ability to program for High Performance Computers, and this allows programmers to make use of the features Python brings. This project will use PyCSP as the bases of the CSP implementation and make modification where required to run the library on ARHCER. Graph500 will be used as a benchmark for implementation, comparing the PyCSP version with a C MPI version supplied by the Graph500 specification. This project results in a partial implementation of the PyCSP features that work on ARCHER using MPI. The final program performed poorer than the C version. This was to be expected from Python and parts of the program were improved though the use of scientific libraries, providing performance oriented data structures in C. Dew to the absence of some of the features the program could not scale beyond one node for the final implementation of Graph500. However the CSP model proves to be reliable with no communication problems during implementation. Further work could see a higher performing library providing CSP for programers to help design and simplify the constructer of parallel and concurrent programs. Contents 1 Introduction 1 1.1 Communicating Sequential Processes . .2 1.1.1 Building systems in CSP . .3 1.1.2 Implementations . .5 1.2 Python Language & Its use within HPC . .7 1.2.1 Python Language . .7 1.2.2 Parallelism and Concurrency . 10 1.2.3 Python Limitations in HPC . 11 1.2.4 PyCSP . 12 1.3 Graph500 . 14 1.3.1 Specification . 14 1.3.2 Graph Generation . 14 1.3.3 Breadth First Search . 16 1.3.4 Implementations . 16 2 Python/CSP implementation 17 2.1 PyCSP Process Creation . 17 2.1.1 Distributed PyCSP Processes . 17 2.2 PyCSP Inter-process Communication . 19 2.2.1 Alternative Solutions and Considerations . 22 2.2.2 Syntax for modified Parallel and modified Channels . 23 2.2.3 Alts! (Alternatives) and Guards . 23 2.3 Multiple Readers and Writers of networked channels . 25 2.4 Graph Generation . 27 2.5 Breadth First Search . 28 2.5.1 Design . 28 2.5.2 Implementation . 30 3 Results 35 3.1 Graph500 MPI Benchmark (C version) . 35 3.1.1 Test conditions . 35 3.1.2 Graph500 Results (C version) . 36 3.2 Python/CSP Results Comparison . 38 3.2.1 Errors . 38 3.2.2 Test Conditions . 38 3.2.3 Execution Time . 38 3.2.4 Errors during results . 38 3.2.5 Length of Program . 39 i 3.2.6 Performance Pitfalls . 40 4 Conclusions 41 5 Further Work 43 5.1 Multi Channels . 43 5.2 Distributed Processes . 43 5.3 PyCSP upgrade . 43 5.4 CSP implementations . 44 5.5 CSP design in other parallel implementations . 44 6 Appendix 45 6.1 Graph500 Makefile . 45 6.2 bfs_simple.c listing . 46 6.3 First implementation of GraphReader.py . 49 6.4 Improved implementation of GraphReader.py . 50 ii List of Figures 1.1 A vending machine that takes in a coin and returns a chocolate and returns to wait on a new coin . .4 1.2 A vending machine that takes in a coin and returns a chocolate or toffee . .4 1.3 Example process graph of 3 concurrent processes connected with point to point channels. [1] . .5 1.4 For loop comparisons between Python, C, Fortan and Java . .8 1.5 Python list comprehension . .8 1.6 Declaration of Classes and Functions in Python . .9 1.7 Python’s Lambda syntax . .9 1.8 Python decorator example . 10 1.9 An example of a send in Mpi4Py with each of the two types of sends. 12 1.10 Sample GNU Octave graph generator supplied by Graph500 . 15 2.1 A PBS script to run a python job on ARCHER . 18 2.2 PyCSP Parallel functions for process instantiation. Docstrings have been re- moved for figure clarity . 19 2.3 Modification to provide each MPI Rank with a process to run . 19 2.4 Example process graph of 3 concurrent processes connected with single channels 20 2.5 Network channel example showing the reading end sink ............. 20 2.6 Network channel example showing the writing end source ........... 21 2.7 Distributed channel management for Distributed Processes . 21 2.8 Physical view of distributed PyCSP processes with networked channels . 22 2.9 Amendments to Alternative.py to allow for MPI channels. 24 2.10 Modifications made to mpi_channels . 24 2.11 PyCSP Graph Solver graph with MPI rank numbers . 25 2.12 Updated channel end discovery . 26 2.13 Changes made to mpi_channel send. Similar changes are made for recv . 26 2.14 Python serial version of the sample graph generation . 27 2.15 Vampier Trace of graph500 sample C/MPI . 29 2.16 Distributed Breadth-First Expansion with 1D Partitioning from [28] . 30 2.17 CSP diagram for PyCSP implementation of Breadth First Search graph solver . 31 2.18 Graph Decomposer process . 31 2.19 Breadth first search worker . 32 2.20 Queue processes . 33 2.21 Main program entry point . 33 3.1 Modified Makefile for Graph500 test . 35 3.2 Graph500 sample code PBS script for one node at scale 12 . 36 iii List of Tables 1.1 Graph500 Problem Class definitions [2] . 14 3.1 Graph500 results for one full node - 24 process- at each graph scale . 36 3.2 Graph500 results for Scale 12 . 37 3.3 Graph500 results for Scale 19 . 37 3.4 Graph500 results for Scale 22 . 37 3.5 Graph500 results for Scale 26 . 37 3.6 PyCSP execution times for one node at Scale 12 . 39 3.7 Line counts for each implementation . 39 3.8 Execution time with initial graphreader.py implementation . 40 iv Acknowledgements I would like to thank my supervisor Dr Daniel Holmes for his support and advice, and family and friends for enduring me talk at length about the project. Thank you to the EPCC for allowing the time and resources to work on this project. Chapter 1 Introduction Programs for High Performance Computing(HPC) have traditionally been programmed in well established programming languages for scientific computing. Primarily C and Fortan, these languages dominate the development of scientific software for high performance computing. Java, C#, Python have done use by some using these higher level languages to perform the same tasks as C and Fortran. Some of the reasons for this, are new programers being taught these languages are more commonly found amongst both programers and people in their chosen field. These languages are more common to general systems programming and are often the first languages that some people can encounter. Some academic staff may have little programming background, and have taken to these languages for there ease of use or are part of the scientific software, for example Matlab. The Python programming language has found some use within High Performance Computing and is readily used outside of HPC. The language is often taught as a beginners language, but it has plenty of library support for scientific computing. Programming in parallel can often be challenging for some programers and depending on the level of expertise, interacting with threads directly - pthreads - or other implementations is often looked over in favour of using higher level abstractions, such as OpenMP or MPI. These abstractions move some of the complexity of process management and communications away from the programer, providing a sanders interface to help them build high performing parallel and concurrent systems. In doing so this reduces the barrier to entry and allows people to make better use of the hardware. MPI provides a single process multiple data view with exploits communication between processes of a large distributed system to parallelise work. OpenMP provides the programer with compiler tags, that allow the compiler to insert and optimise the relevant code. This makes better use of shared memory machines that provide a large number of concurrent threads. Poor performing parts of programs can be annotated to tell the compiler to brake the work up amongst theses threads and improve the performance of the program. Using these tools the programer can then spend time thinking - for both MPI and OpenMP - how best to organise and access their data, making effective use of parallelism and concurrency. Communication Sequential Processes provides a model for concurrent and parallel systems. This modal can be implemented by using the threading and communication primitives to provide the programer with a clear, runtime safe and structured view of their program. Imple- menting CSP can help the programer think and describe with grater clarity the structure of the program. For languages such as Python, this would allow a high level language to flexibly and clearly implement a program for high performance computers.

Load more