Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting

Dynamic Task Scheduling and Binding for Many-Core Systems through Stream Rewriting Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) der Fakultät für Informatik und Elektrotechnik der Universität Rostock vorgelegt von Lars Middendorf, geb. am 21.09.1982 in Iserlohn aus Rostock Rostock, 03.12.2014 Gutachter Prof. Dr.-Ing. habil. Christian Haubelt Lehrstuhl "Eingebettete Systeme" Institut für Angewandte Mikroelektronik und Datentechnik Universität Rostock Prof. Dr.-Ing. habil. Heidrun Schumann Lehrstuhl Computergraphik Institut für Informatik Universität Rostock Prof. Dr.-Ing. Michael Hübner Lehrstuhl für Eingebettete Systeme der Informationstechnik Fakultät für Elektrotechnik und Informationstechnik Ruhr-Universität Bochum Datum der Abgabe: 03.12.2014 Datum der Verteidigung: 05.03.2015 Acknowledgements First of all, I would like to thank my supervisor Prof. Dr. Christian Haubelt for his guidance during the years, the scientific assistance to write this thesis, and the chance to research on a very individual topic. In addition, I thank my colleagues for interesting discussions and a pleasant working environment. Finally, I would like to thank my family for supporting and understanding me. Contents 1 INTRODUCTION........................................................................................................................1 1.1 STREAM REWRITING ...................................................................................................................5 1.2 RELATED WORK .........................................................................................................................7 1.3 CONTRIBUTIONS.......................................................................................................................13 1.4 PUBLICATIONS .........................................................................................................................16 2 STREAM REWRITING ..............................................................................................................18 2.1 STREAM REWRITING MACHINE ...................................................................................................18 2.2 PARALLEL REWRITING ...............................................................................................................23 2.3 SCHEDULING ...........................................................................................................................31 2.4 STREAM COMPRESSION .............................................................................................................39 2.5 MEMORY ACCESS.....................................................................................................................42 3 SOURCE MODELS ...................................................................................................................48 3.1 FUNCTIONAL PROGRAMS ...........................................................................................................48 3.2 TASK GRAPHS..........................................................................................................................51 3.3 C SOURCE CODE ......................................................................................................................60 4 MULTI-CORE ARCHITECTURES ................................................................................................65 4.1 INTRODUCTION ........................................................................................................................65 4.2 SHARED MEMORY ARCHITECTURE ...............................................................................................67 4.3 STREAM REWRITING NETWORK...................................................................................................74 4.4 COMPARISON ..........................................................................................................................79 4.5 CONCLUSION...........................................................................................................................82 5 GRAPHICS PROCESSING..........................................................................................................83 5.1 INTRODUCTION ........................................................................................................................83 5.2 RELATED WORK .......................................................................................................................85 5.3 GENERIC GRAPHICS PIPELINE......................................................................................................92 5.4 FUNCTIONAL PROCESSOR...........................................................................................................96 5.5 SIMT SHADER CORE...............................................................................................................103 5.6 GENERAL PURPOSE GRAPHICS PROCESSOR..................................................................................114 6 HIGH-LEVEL SYNTHESIS ........................................................................................................128 6.1 INTRODUCTION ......................................................................................................................128 6.2 RELATED WORK .....................................................................................................................129 6.3 IMPLEMENTATION ..................................................................................................................130 6.4 RESULTS ...............................................................................................................................133 7 CONCLUSION........................................................................................................................137 7.1 RESULTS ...............................................................................................................................137 7.2 FUTURE WORK ......................................................................................................................138 8 REFERENCES.........................................................................................................................141 1 Introduction The scalability of modern computer architectures is mostly limited by stagnating clock rates and increased energy consumption. Multi and many-core systems offer the ability to overcome these issues by performing computations on several functional units in parallel. Although a multi-core design usually consumes a larger area than an equivalent single-core design, it can offer a similar computational power at lower clock frequencies and thus helps to reduce thermal heat or power issues [1]. Thus, designing hardware and software architectures consequently towards concurrency and parallelism are currently the most effective and in the long term the best approach of accelerating an application [2]. The concept of multiple cores can be recognized in general purpose processors [3] and graphics processing units [4], which are optimized for maximum performance, but also in mobile and embedded systems to perform computationally intensive tasks in resource constrained environments [5]. In particular, the integration of more functional units and their interconnections already involves significant technical problems to be solved at the hardware domain. For instance, Figure 1 shows three example processors with a complexity ranging from 29K to 7.1B transistors. Although each generation can take advantage of technological improvements and shrunk structures, the simple duplication of arithmetic and logical blocks leads to an unconstrained grow of the power usage [6]. Hence, it is often better to spend the newly won capacity with optimized and therefore more specialized functionality, which results in complex heterogeneous systems. As a result, the integration of different processor types becomes possible but on the other hand, the utilization of these additional resources turns out to be a significant and currently unsolved issue. Due to its complexity, the task management is often moved to the software layer, which is now responsible for the binding and scheduling of concurrent work items to processor cores [7] and involves several different programming models (Section 1.2.1). When developing embedded real-time systems, the focus usually lies on meeting hard deadlines that require the completion of certain tasks within a given period of time. For these applications, a static schedule is usually pre-computed at design time to ensure correctness and predictability of the system. Intel 8088, no cache, Intel Nehalem, 8 cores, NVidia GK110, 29K transistors [6] 2.3B transistors [6] 7.1B transistors [8] 1979 2010 2013 Figure 1. Photographs of various processors from the 8088 to modern GPUs. 1 Figure 2. Architecture of the Tesla GPU [9]. Figure 3. Streaming multiprocessor [9]. However, the creation of a static schedule can take a significant amount of time and is only feasible if the expected workload is known at design time. Hence, the software development for many-core architectures like the SCC processor with 48 cores [10], the 64- core tile processor [11] or graphics processing units [9] becomes challenging in case of dynamic parallelism. Especially for irregular workloads, the efficient communication between cooperating tasks induces several question not yet solved in general [12], so that the resulting schedule is usually not an exact solution [13]. Therefore,

Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting

Antikernel: a Decentralized Secure Hardware-Software Operating

Mali-400 MP: a Scalable GPU for Mobile Devices

Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Speciﬁc Workloads

An Overview of MIPS Multi-Threading White Paper

C for a Tiny System Implementing C for a Tiny System and Making the Architecture More Suitable for C

Antikernel: a Decentralized Secure Hardware-Software Operating System Architecture

On the Exploration of the DRISC Architecture

Programming the Cray XMT at PNNL

Abstract PRET Machines

Concurrent Programming CLASS NOTES

Understanding SPARC Processor Performance

18-643 Lecture 11: Intel Opencl for FPGA