Dataflow Development of Medium-Grained Parallel Software

THE UNIVERSITY OF NEWCASTLE UPON TYNE DEPARTMENT OF COMPUTING SCIENCE Dataflow Development of Medium-Grained Parallel Software by Jonathan William Harley PhD Thesis September 1993 O~)] ') I) :.::3') 14 - i l 1.....-- \ .. \ Abstract In the 1980s, multiple-processor computers (multiprocessors) based on conven tional processing elements emerged as a popular solution to the continuing demand for ever-greater computing power. These machines offer a general-purpose parallel processing platform on which the size of program units which can be efficiently executed in parallel - the "grain size" - is smaller than that offered by distributed computing environments, though greater than that of some more specialised architectures. However, programming to exploit this medium-grained parallelism remains difficult. Concurrent execution is inherently complex, yet there is a lack of programming tools to support parallel programming activities such as program design, implementation, debugging, performance tuning and so on. In helping to manage complexity in sequential programming, visual tools have often been used to great effect, which suggests one approach towards the goal of making parallel programming less difficult. This thesis examines the possibilities which the dataflow paradigm has to offer as the basis for a set of visual parallel programming tools, and presents a dataflow notation designed as a framework for medium-grained parallel programming. The implementation of this notation as a programming language is discussed, and its suitability for the medium-grained level is examined. Acknowledgements/Dedication Although this thesis and the research it describes are my own work, I would like to express my gratitude to several people whose support I received in the course of the work. First among these must come my supervisor, Professor Peter Lee, for encouragement and guidance throughout my time at Newcastle. Additionally, I would like to thank Dr. Graham Megson and Dan McCue for constructive criticism on many pieces of work which eventually contributed to this thesis. I spent three months in Rennes, France during the course of my research, and I would like to thank Fran<;oise Andre for her support and encouragement during that time. Finally, I acknowledge the financial support of the Science and Engineering Research Council of Great Britain during the first three years of my research, and of the Ee ERASMUS scheme for funding my stay in Rennes. This work is dedicated to the memory of my father, Dr. Anthony John Harley, 1933-1987, at one time an honorary lecturer in computer science at this university. Table of Contents List of Figures 7 Chapter 1 Introduction 9 1.1 Parallel Computing 9 1.2 Parallel Programming - the Problems 11 1.3 Solutions to the Problems 14 1.4 Dataflow 17 1.5 Main Contribution 20 1.6 Structure of the Subsequent Chapters 22 Chapter 2 Parallel Programming 23 2.1 Multiprocessors 23 2.1.1 The von Neumann machine 24 2.1.2 SISD Architectures 25 2.1.3 SIMD Architectures 25 2.1.4 MIMD Architectures 26 2.1.5 Distributed-memory Architectures 27 2.1.6 Shared-memory Architectures 29 2.1.7 Grain Size 30 2.2 Parallel Programming Models 32 2.2.1 Types of Parallelism 33 2.2.2 Parallel programming models 34 2.3 Programming Tools 38 2.3.1 Tools for Parallel Programming 39 2.3.2 Visual Programming 40 2.4 Dataflow 44 2.4.1 Dataflow for Parallelism 44 2.4.2 Archi tectures 46 2.4.3 Low-level dataflow languages 48 Page 4 2.4.4 Dataflow design languages 49 2.5 Summary 51 Chapter 3 The MeDaL Notation 53 3.1 Concepts 54 3.2 MeDaL Semantics and Syntax 56 3.2.1 Datapaths 56 3.2.2 Actors 58 3.2.3 Companies 62 3.3 Design Decisions 63 3.3.1 General Visual Syntax 64 3.3.2 Actor Semantics 65 3.3.3 Datapath Semantics 67 3.3.4 Persistent Memory 69 3.3.5 Multiple Streams 72 3.4 Examples 79 3.4.1 The MeDaL Library 79 3.4.2 Example MeDaL Program 81 3.5 Summary 85 Chapter 4 Implementation Issues 87 4.1 Functions of the Programming System 88 4.1.1 The Division of Work 88 4.1.2 The Method Code Transformer 90 4.1.3 The Harness Generator 91 4.1.4 The Run-time System 92 4.2 Programming Language Interface 93 4.2.1 Programming actors in C and FORTRAN 94 4~2.2 Programming Interface in C++ 95 4.3 Distributed-memory Run-time System 99 4.3.1 Datapaths Between Distributed Actors 99 4.3.2 Run-time System for Distributed Actors 101 Page 5 4.3.3 The Wrappers of Distributed Actors 103 4.4 Shared-memory Run-time System 105 4.4.1 Shared-memory Datapaths 106 4.4.2 Shared-memory Wrappers 107 4.4.3 Other Functions 110 4.5 Summary 111 Chapter 5 Performance Evaluation 113 5.1 Implementation Environment 115 5.2 MeDaL Run-time System Implementation 117 5.2.1 General Configuration 117 5.2.2 Specific Functions 118 5.3 Experimental Results 121 5.3.1 Basic Functions 122 5.3.2 Use of Basic Functions within Programs 123 5.3.3 Varying Numbers of Actors 129 5.4 Evaluation of Results 135 5.4.1 Grain Size 136 5.4.2 Limitations 137 5.4.3 Conclusions 138 Chapter 6 Conclusions 141 6.1 Summary 141 6.2 Conclusions 145 6.3 Future Work 149 6.4 Closing Remarks 151 Bibliography 153 Appendix A MeDaL Classes 164 Appendix B Example Application 172 Page 6 List of Figures Figure Page 1.4a Dataflow graph to calculate x=4y + .I(z77) 18 2.1a Examples of interconnection in distributed- memory multiprocessor architectures 28 2.1b Typical shared-memory architecture 29 2.1c Definition of multiprocessor grain sizes 31 3.2a Datapath syntax 56 3.2b General-purpose actors 60 3.2c Source actor 60 3.2d Sink actor 61 3.2e Merge actor 61 3.2f Replicator 62 3.2g Company (component view) 63 3.3a Memory entity 70 3.3b Persistent memory within an actor 71 3.3c Output looped back to input 73 3.3d Equalising method merge 74 3.3e Datapaths to persistent memory 75 3.3f Shared memory 76 3.3g Demand-driven dataflow 77 3.3h Synchronous and Asynchronous inputs 78 3.4a Standard input and output actors 80 3.4b File input and output actors 80 3.4c Halt actor 81 3.4d MeDaL example program "matrix-mult" 82 4.0a Modules of a MeDaL programming system 87 4.2a Wrapper and real method functions using C++ inteiface 98 4.4a Shared-memory transmit algorithm 109 5.3a Basic RTS functions 122 5.3b Time-activity diagram - scenario 1 125 5.3c MeDaL diagram - scenario 1 126 Page 7 5.3d Timings - scenario 1 127 5.3e Time-activity diagram - scenario 2 127 5.3f MeDaL diagram - scenario 2 128 5.3g Timings - scenario 2 128 5.3h Simplified matrix multiplication example 130 5.3i Matrix multiplication - speedup against sequential MeDaL program 132 5.3j Matrix multiplication - speedup against serial program 134 Page 8 Chapter 1 Introduction 1.1 Parallel Computing The relentless drive for higher throughput m manufacturing industries has led factory managers not only to buy faster machines, which can speed up their production line by producing parts faster, but also to seek greater efficiency on the factory floor - for instance, by having two parts which are not components of each other made by different machines at the same time. Such gains in throughput are also desired from computing equipment; there is a clear and continuing demand for computers which can do more processing in less time. Not only are computers continually developed with ever faster number crunching power, but the concept of increasing computational capacity by executing mutually independent computations in parallel is also increasingly being employed.' In the race to build ever more powerful computers, the designers of the leading edge machines have therefore chosen paths which are leading away from the sequential processing model of John von Neumann, into designs which allow the concurrent processing of a number of parts of a computation. Of course, the success of these techniques relies on parallelism being available for exploitation in the software being developed. It is not always possible to parallelise: in the general case, any two instructions can be executed concurrently only when the data operated on by each one is not a direct result of the other - in which case there is no direct data dependency between those instructions. The same can be said of program modules which have inputs and outputs: two modules can only be executed concurrently when none of the inputs of either is an output of the other. This is equivalent to two machines being able to make parts of a man-made product which are not components of one another. For some computer applications, the absence of data dependencies between instructions is obvious; an example is vector multiplication, in which the Page 9 Chapter 1 Introduction multiplication of each element of the vector by a scalar does not depend on the value of any of the vector's other elements. The usefulness of this characteristic of vector operations has led to one type of architecture for computers known as vector processors, which exploit the parallelism of vector operations by processing elements of a vector concurrently. However, such vector-based machines are clearly specialised in terms of what they are good at. Many, if not most, commercial applications do not employ large amounts of vector processing which could be executed faster by employing the type of machine just described.

Dataflow Development of Medium-Grained Parallel Software

Computer Architecture: Dataflow (Part I)

A Dataflow Architecture for Beamforming Operations

Arxiv:1801.05178V1 [Cs.AR] 16 Jan 2018 Single-Instruction Multiple Threads (SIMT) Model

Design Space Exploration of Stream-Based Dataflow Architectures

Object-Oriented Development for Reconfigurable Architectures

Dataflow Computing Models, Languages, and Machines for Intelligence Computations

Hyperflow: a Heterogeneous Dataflow Architecture

Comparative Analysis of Dataflow Engines and Conventional Cpus in Data-Intensive Applications

Neuflow: a Runtime Reconfigurable Dataflow Processor for Vision

Cognitive Dimensions Usability Assessment of Textual and Visual VHDL Environments

High Level Synthesis with a Dataflow Architectural Template

Compiling for EDGE Architectures