Processing Real-Time Data Streams on GPU-Based Systems

Processing Real-time Data Streams on GPU-based Systems Uri Verner Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 Processing Real-time Data Streams on GPU-based Systems Research Thesis Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Uri Verner Submitted to the Senate of the Technion — Israel Institute of Technology Tevet 5776 Haifa December 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 This research was carried out under the supervision of Prof. Assaf Schuster and Prof. Avi Mendelson, in the Faculty of Computer Science. L P Some results in this thesis have been published as articles by the author and research collab- orators in conferences and journals during the course of the author’s doctoral research period, the most up-to-date versions of which being: · Uri Verner, Avi Mendelson, and Assaf Schuster. Extending Amdahl’s law for multicores with Turbo Boost. IEEE Computer Architecture Letters (CAL), 2015. · Uri Verner, Avi Mendelson, and Assaf Schuster. Scheduling periodic real-time communication in multi-GPU systems. 23rd International Conference on Computer Communication and Networks (ICCCN), pages 1–8, August 2014. · Uri Verner, Avi Mendelson, and Assaf Schuster. Batch method for efficient resource sharing in real-time multi-GPU systems. Distributed Computing and Networking, volume 8314 of LNCS, pages 347–362, 2014. · Uri Verner, Assaf Schuster, Mark Silberstein, and Avi Mendelson. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR), pages 1–12, June 2012. · Uri Verner, Assaf Schuster, and Mark Silberstein. Processing data streams with hard real- time constraints on heterogeneous systems. Proceedings of the International Conference on Supercomputing (ICS), pages 120–129, May 2011. A I would like to thank my advisors, Prof. Assaf Schuster and Prof. Avi Mendelson, for their wise support, guidance, and life experience. I would also like to thank Prof. Mark Silberstein for interesting me in this subject and mentoring me during the first two years of my graduate studies; his effort and contribution to my research are greatly appreciated. My studies would not have been as pleasant without the devoted care of Yifat and Sharon. I enjoyed working with you and thank you wholeheartedly for your help. Finally, I would like to thank my family for their love and support. I thank my grandparents for being a source of inspiration to me. I thank my father for his advice and encouragement. Lastly, I wish to thank my wife Sveta for her understanding, love, support, and patience, which enabled me to complete this work. The generous financial help of the Metro 450 consortium, the Hasso Plattner Institute, and the Technion is gratefully acknowledged. Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 Contents List of Figures List of Tables List of Listings Abstract 1 Abbreviations and Notations 3 1 Introduction 7 1.1 Motivating Example: Wafer Production Inspection ............... 7 1.2 Heterogeneous Systems .............................. 8 1.3 Real-Time Systems ................................ 10 1.4 Stream Processing ................................. 11 1.5 Contribution .................................... 11 1.5.1 Hard Real-Time Data Stream Processing Framework .......... 11 1.5.2 CPU/GPU Work Distribution ...................... 12 1.5.3 Communication Scheduling on Heterogeneous Interconnects ...... 13 1.5.4 Generalization to Scheduling Parallel Resources ............ 13 1.6 Organization ................................... 14 2 Background and Related Work 15 2.1 Heterogeneous Systems .............................. 15 2.1.1 Multicore CPUs ............................. 16 2.1.2 Graphics Processing Units (GPUs) .................... 16 2.1.3 CPU-GPU Systems ............................ 20 2.1.4 Multi-GPU Systems ........................... 23 2.1.5 System Interconnect ........................... 23 2.2 Real-Time Scheduling .............................. 25 2.2.1 Introduction ................................ 25 2.2.2 Task Model ................................ 27 2.2.3 Computing Platform Model ....................... 28 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 2.2.4 Other Scheduling Problems ....................... 29 2.3 Real-Time Scheduling in CPU-GPU Systems .................. 30 2.3.1 Overview ................................. 30 2.3.2 Limited Control over GPU Resources .................. 31 2.3.3 Estimation of Execution Time ...................... 33 2.3.4 Resource Scheduling ........................... 34 3 System Model 37 3.1 Hardware Model ................................. 37 3.2 Real-Time Data Stream .............................. 37 3.3 Packet Processing Model ............................. 38 4 Stream Processing in a CPU-GPU System 41 4.1 GPU Batch Scheduling .............................. 42 4.1.1 Schedulability .............................. 43 4.1.2 GPU Empirical Performance Model ................... 43 4.2 CPU Scheduling .................................. 45 4.2.1 Scheduling Algorithms .......................... 45 4.3 R Method for CPU/GPU Stream Distribution ............. 51 4.3.1 Considerations in Workload Distribution Between a CPU and a GPU . 51 4.3.2 R Method ........................... 53 4.3.3 GPU Performance Optimizations .................... 54 4.3.4 Real-Time Stream Processing Framework ................ 56 4.4 Evaluation ..................................... 57 4.4.1 Experimental Platform .......................... 57 4.4.2 Setting Parameters ............................ 58 4.4.3 Results .................................. 58 4.4.4 Performance Model ............................ 61 4.4.5 Time Breakdown ............................. 61 4.4.6 How Far Are We From the Optimum? .................. 62 4.5 Related Work ................................... 62 4.6 Conclusions .................................... 64 4.7 Rectangle Method Algorithm Pseudo-code .................... 65 5 Stream Processing in a Multi-GPU System 67 5.1 Virtual GPU .................................... 67 5.2 Split Rectangle .................................. 69 5.3 Joined Rectangles ................................. 70 5.4 Loose Rectangles ................................. 70 5.5 Evaluation of Virtual GPU ............................ 70 5.5.1 Experimental Platform .......................... 70 5.5.2 System Performance ........................... 71 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 5.5.3 Comparison to Baseline Methods .................... 71 5.5.4 Scaling to More GPUs .......................... 73 5.5.5 Transfer Rates .............................. 74 5.5.6 Accuracy of Performance Model ..................... 74 5.6 Summary ..................................... 76 6 Data Transfer Scheduling on the Interconnect 79 6.1 Introduction .................................... 79 6.2 Problem Definition ................................ 82 6.2.1 System Model .............................. 82 6.2.2 Problem Statement ............................ 86 6.3 Data Transfer Scheduling Method: Overview .................. 86 6.4 Batch Method for Data Transfer and Computations ............... 87 6.4.1 Sharing Resources in Real-time Systems ................ 89 6.4.2 Interface ................................. 90 6.4.3 Implementation .............................. 91 6.4.4 Performance of the Batch Method for the Example in Section 6.4.1 .. 95 6.4.5 Evaluation ................................ 95 6.4.6 Discussion ................................ 101 6.5 Controlling Bandwidth Distribution to Reduce Execution Times ........ 102 6.5.1 Problem Definition ............................ 103 6.5.2 Formula for the Optimal Bandwidth Distribution Factor ........ 104 6.5.3 Evaluation of Bandwidth Distribution Method .............. 105 6.5.4 Conclusions ................................ 107 6.6 Task Set Optimization ............................... 108 6.6.1 Data Transfers to a Set of Batch Messages ................ 108 6.6.2 Message Set Transformation Algorithm ................. 109 6.6.3 Message Combining Operations ..................... 109 6.6.4 Message Set Scheduling and Validity Test ................ 111 6.7 Evaluation ..................................... 112 6.7.1 Application 1: Wafer Inspection ..................... 112 6.7.2 Application 2: Stream Processing Framework .............. 112 6.8 Conclusions .................................... 115 6.9 MATLAB Implementation ............................ 117 6.9.1 Batch Run-Time Scheduling Algorithm ................. 117 6.9.2 Interconnect Model ............................ 121 6.9.3 Computation of Data Rates ........................ 124 6.9.4 Non-Preemptive EDF Schedulability Test ................ 126 Technion - Computer Science Department - Ph.D. Thesis PHD-2015-15 - 2015 7 Extended Multiprocessor Model for Efficient Utilization of Parallel Resources 129 7.1 Introduction .................................... 129 7.2 New Class of Scheduling Problems ........................ 131 7.2.1 Machine Environment .......................... 132 7.2.2 Job Characteristics ............................ 133 7.2.3 Optimality Criteria ...........................

Load more