Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Compiling for a multithreaded dataflow architecture : algorithms, tools, and experience Feng Li To cite this version: Feng Li. Compiling for a multithreaded dataflow architecture : algorithms, tools, and experience. Other [cs.OH]. Université Pierre et Marie Curie - Paris VI, 2014. English. NNT : 2014PA066102. tel-00992753v2 HAL Id: tel-00992753 https://tel.archives-ouvertes.fr/tel-00992753v2 Submitted on 10 Sep 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. UniversitéPierre et Marie Curie Ecole´ Doctorale Informatique, Télécommunications et Electronique´ Compiling for a multithreaded dataflow architecture: algorithms, tools, and experience par Feng LI Thèsede doctorat d'Informatique Dirigéepar Albert COHEN Présentée et soutenue publiquement le 20 mai, 2014 Devant un jury composéde: , Président , Rapporteur , Rapporteur , Examinateur , Examinateur , Examinateur I would like to dedicate this thesis to my mother Shuxia Li, for her love Acknowledgements I would like to thank my supervisor Albert Cohen, there's nothing more exciting than working with him. Albert, you are an extraordi- nary person, full of ideas and motivation. The discussions with you always enlightens me, helps me. I am so lucky to have you as my supervisor when I first come to research as a PhD student. I would also like to thank Dr. Antoniu Pop, I still enjoy the time we were working together, helped a lot during my thesis, as a good colleague, and a friend. I am grateful to my thesis reviewers: Prof. Guang Gao and Dr. Fabrice Rastello. Thanks for your time carefully reading the thesis, and providing the useful feedbacks. For the revised version of the thesis, I would also like to thank Dr. StéphaneZuckerman, he gives very useful suggestions and comments. I would like to thank all the members in our PARKAS team. Prof. Marc Pouzet, Prof. Jean Vuillemin, and Dr. Francesco Zappa Nardelli invite lots of talks in our group, bring in new ideas and inspirations. Tobias Grosser, Jun Inoue, Riyadh Baghdadi, I still enjoy the time we discussing together, climbing together and flying to conferences. Boubacar Diouf, Jean-Yves Vet, Louis Mandel, Adrien Guatto, Tim- othy Bourke, CédricPasteur, LéonardGérard,Guillaume Baudart, Robin Morisset, Nhat Minh Lê,you guys are wonderful people to work with! Special thanks to my friends, Wenshuang Chen and Shunfeng Hu, always like your cooking! Also my friends Shengjia Wang, Ke Song, Yuan Liu, enjoy our beer time, the moments we discuss and implement a new idea. Abstract Across the wide range of multiprocessor architectures, all seem to share one com- mon problem: they are hard to program. It is a general belief that parallelism is a software problem, and that perhaps we need more sophisticated compilation techniques to partition the application into concurrent threads. Many experts also make the point that the underlining architecture plays an equally important role: there needs to be a fundamental change in processor architecture before one may expect significant progress in the programmability of multiprocessors. Our approach favors a convergence of these viewpoints. The convergence of dataflow and von Neumann architecture promises latency tolerance, the ex- ploitation of a high degree of parallelism, and light thread switching cost. Mul- tithreaded dataflow architectures require a high degree of parallelism to tolerate latency. On the other hand, it is error-prone for programmers to partition the program into large number of fine grain threads. To reconcile these facts, we aim to advance the state of the art in automatic thread partitioning, in combination with programming language support for coarse-grain, functionally deterministic concurrency. This thesis presents a general thread partitioning algorithm for transform- ing sequential code into a parallel data-flow program targeting a multithreaded dataflow architecture. Our algorithm operates on the program dependence graph and on the static single assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. We design a new intermediate representation to ease code generation for an explicit token match dataflow execution model. We also implement a GCC-based prototype. We also evaluate coarse-grain dataflow extensions of OpenMP in the context of a large-scale 1024-core, simulated multithreaded dataflow architecture. These extension and simulated architecture allow the exploration of innovative memory models for dataflow computing. We evaluate these tools and models on realistic applications. Contents Contentsv List of Figures ix Nomenclature xii 1 Introduction1 1.1 Hybrid Dataflow for Latency Tolerance . .2 1.1.1 Convergence of dataflow and von Neumann . .2 1.1.2 Latency Tolerance . .3 1.1.3 TSTAR Multithreaded Dataflow Architecture . .5 1.2 Task Granularity . .7 1.3 Motivation . .9 1.4 Dissertation Outline . 10 2 Problem Statement 12 2.1 Explicit token matching shifts the challenges in hardware design to compilation . 13 2.2 The complex data structure should be handled in an efficient way 14 2.3 Related Work . 15 2.3.1 Compiling imperative programs to data-flow threads . 15 2.3.2 SSA as an intermediate representation for data-flow compilation . 16 2.3.3 Decoupled software pipelining . 16 2.3.4 EARTH thread partitioning . 17 v CONTENTS 2.3.5 Formalization of the thread partitioning cost model . 18 3 Thread Partitioning I: Advances in PS-DSWP 19 3.1 Introduction . 19 3.1.1 Decoupled software pipelining . 20 3.1.2 Loop distribution . 21 3.2 Observations . 21 3.2.1 Replacing loops and barriers with a task pipeline . 21 3.2.2 Extending loop distribution to PS-DSWP . 22 3.2.3 Motivating example . 23 3.3 Partitioning Algorithm . 30 3.3.1 Definitions . 31 3.3.2 The algorithm . 32 3.4 Code Generation . 36 3.4.1 Decoupling dependences across tasks belonging to different treegions . 36 3.4.2 SSA representation . 37 3.5 Summary . 40 4 TSTAR Dataflow Architecture 42 4.1 Dataflow Execution Model . 42 4.1.1 Introduction . 42 4.1.2 Past Data-Flow Architectures . 43 4.2 TSTAR Dataflow Execution Model . 47 4.2.1 TSTAR Multithreading Model . 47 4.2.2 TSTAR Memory Model . 48 4.2.3 TSTAR Synchronization . 50 4.2.4 TSTAR Dataflow Instruction Set . 52 4.3 TSTAR Architecture . 55 4.3.1 Thread Scheduling Unit . 56 5 Thread Partitioning II: Transform Imperative C Program to Dataflow Program 61 5.1 Revisit TSTAR Dataflow Execution Model . 62 vi CONTENTS 5.2 Partitioning Algorithms . 65 5.2.1 Loop Unswitching . 66 5.2.2 Build Program Dependence Graph under SSA . 67 5.2.3 Merging Strongly Connected Components . 69 5.2.4 Typed Fusion . 69 5.2.5 Data Flow Program Dependence Graph . 71 5.3 Modular Code Generation . 74 5.4 Implementation . 76 5.5 Experimental Validation . 78 5.6 Summary . 82 6 Handling Complex Data Structures 83 6.1 Streaming Conversion of Memory Dependences (SCMD) . 84 6.1.1 Motivating Example . 84 6.1.2 Single Producer Single Consumer . 86 6.1.3 Single Producer Multiple Consumers . 87 6.1.4 Multiple Producers Single Consumer . 91 6.1.5 Generated Code for Motivating Example . 92 6.1.6 Discussion . 96 6.2 Owner Writable Memory . 96 6.2.1 OWM Protocol . 97 6.2.2 OWM Extension to TSTAR . 99 6.2.3 Expressiveness . 101 6.2.4 Case Study: Matrix Multiplication . 102 6.2.5 Conclusion and perspective about OWM . 103 6.3 Summary . 105 7 Simulation on Many Nodes 106 7.1 Introduction . 107 7.2 Multiple Nodes Dataflow Simulation . 108 7.3 Resource Usage Optimization . 110 7.3.1 Memory Usage Optimization . 110 7.3.2 Throttling . 115 vii CONTENTS 7.4 Experimental Validation . 117 7.4.1 Experimental Settings . 117 7.4.2 Experimental Results . 118 7.4.2.1 Gauss Seidel . 121 7.4.2.2 Viola Jones . 124 7.4.2.3 Sparse LU . 126 7.5 Summary . 129 8 Conclusions and Future Work 131 8.1 Contributions . 131 8.2 Future Work . 132 Personal Publications 136 References 138 viii List of Figures 1.1 TStar High level Architecture. .6 1.2 The Parallelism-Overhead Trade-off (from Sarkar) . .8 2.1 General strategy for thread partitioning. 13 3.1 Barriers inserted after loop distribution. 21 3.2 Pipelining inserted between distributed loops. Initialize the stream (left), producer and consumer thread (right). 22 3.3 Uncounted nested loop before partitioning. 24 3.4 Uncounted nested loop in SSA form. 25 3.5 Loops after partitioning and annotated with OpenMP stream extension. 26 3.6 Pipelining and parallelization framework. 27 3.7 Definition and use edges in the presence of control flow. 28 3.8 Control dependence graph of Figure 3.3. Express the definition of treegion. 30 3.9 Split conditional statements to expose finer grained pipelining. 32 3.10 Algorithm for marking the irreducible edges. 33 3.11 Structured typed fusion algorithm. 35 3.12 Before partitioning (left), and After partitioning (right). Loop with control dependences. 36 3.13 Normal form of code (left) and using streams (right). 37 3.14 Normal form of code (left) and SSA form of the code (right). 37 3.15 Apply our algorithm to generate the parallel code. Producer thread (left) and consumer thread (right).

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

High Performance Computing Through Parallel and Distributed Processing

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Scheduling on Asymmetric Parallel Architectures

CUDA C++ Programming Guide

14. Parallel Computing 14.1 Introduction 14.2 Independent

CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

Parallel Programming in Openmp About the Authors

CUDA Dynamic Parallelism

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures