Analysis and Modeling of the Timing Behavior of GPU Architectures

Analysis and Modeling of the Timing Behavior of GPU Architectures Master Thesis Petros Voudouris Electronic Systems group Faculty of Electrical Engineering Eindhoven University of Technology Supervisors: Henk Corporaal Gert-Jan van den Braak November 2014 1 2 Abstract Graphics processing units (GPUs) offer massive parallelism. Since a couple of years GPUs can also be used for more general purpose applications; a wide variety of applications can be accelerated efficiently with the use of the CUDA and OpenCL programming models. Real-time systems frequently use many sensors that produce a big amount of data. GPUs can be used to process the data and remove the workload from the central process unit (CPU). To use the GPUs in real time systems it is required to have time predictable behavior. However, it is hard to give an estimation of the worst case execution time (WCET) of a GPU program, since only few timing details of the GPU architecture are given. In addition, inside the GPUs several arbiters are used, and their scheduling details are not always clearly described. The Nvidia Fermi architecture is analyzed in order to identify the sources of time variation and unpredictability. Initially, all the main components of the architecture are discussed and investigated for sources of time variation and predictability. From the analyzed components of the architecture we chose to continue with the warp scheduler, the scratchpad memory and the out-of-chip global memory. The warp scheduler determines the schedule of the warps and as a result influences significantly the total execution time. The scratchpad memory is widely used in GPU application to hide the memory latency. Finally, the global memory instructions were analyzed since they are used from all the applications. Micro-benchmarks implemented in assembly were used to quantify the sources of variation. We execute the same experiments multiple times in order to determine the variation of the execution time among the experiments. A timing model for the warp scheduler based on results from the experiments was introduced. The model was tested for Convolution separable benchmark from CUDA SDK. The model was accurate for average case (96.5% of the experiments) with 2% and 3% error for Rows and Columns kernel respectively. For the best case the model had -15% and -54% and worst case 28% and 1% error for rows and columns kernel (3.5% of the experiment). In addition, it was shown that the model is scalable up to 6 warps without any modification. Furthermore, the variable-rate warp scheduler was implemented in GPGPU-Sim that improves the time predictability of the GPU by assigning the warps in fixed slots. Initial results show that by assigning one warp with higher scheduling rate can improve the performance also by taking advantage of the inter-warp locality of the warps. Finally, based on the findings of this report, the GPU can be used in real-time systems with soft deadlines. By using the suggested changes of the architecture, the time predictability of the architecture can be improved even more and support applications with more strict timing constrains. 3 4 Acknowledgments I would like to express my sincere gratitude to my supervisors Professor Henk Corporaal and Gert-Jan van den Braak, who have supported me throughout this thesis with great patience and immense knowledge. Without their help the realization of this project would not be possible. Their suggestions and encouragement helped me to overcome difficulties throughout this project and especially during writing this report. Furthermore, I would like to thank all the members of the PARsE group for their valuable feedback. Last but not least, I would like thank to my family and my friends for their continuous support during my whole studies. 5 6 Table of Contents 1. Introduction .......................................................................................................................................... 9 1.1. Motivation ..................................................................................................................................... 9 1.2. Problem statement ..................................................................................................................... 10 1.3. Organization of the report .......................................................................................................... 10 2. Background ......................................................................................................................................... 11 2.1. GPU architecture ......................................................................................................................... 11 2.2. CUDA programming model ......................................................................................................... 13 2.3. GPGPU-Sim .................................................................................................................................. 15 3. Related work ....................................................................................................................................... 17 3.1. Time predictability ...................................................................................................................... 17 3.2. Fermi GPU timing analysis .......................................................................................................... 18 4. Time predictability for GPUs ............................................................................................................... 19 4.1. Introduction to time predictability ............................................................................................. 19 4.2. Analysis of hardware components for timing variation ............................................................. 21 4.3. Conclusion ................................................................................................................................... 27 5. Quantification of time variation in Fermi GPU architecture ............................................................... 29 5.1. Experimental Methodology ........................................................................................................ 29 5.2. Warp scheduler ........................................................................................................................... 31 5.3. On-chip scratchpad memory ....................................................................................................... 37 5.4. Off-chip - Global memory ........................................................................................................... 38 5.5. Conclusion ................................................................................................................................... 40 6. Timing model ...................................................................................................................................... 41 6.1. Instruction scheduling timing model for a single warp .............................................................. 41 6.2. Formulas of the timing model. .................................................................................................... 43 6.3. Scalability of the model............................................................................................................... 44 6.4. Conclusion ................................................................................................................................... 47 7. Benchmark: Convolution Separable.................................................................................................... 49 7.1. Application of the model ............................................................................................................ 49 7.2. Comparison ................................................................................................................................. 51 7.3. Effect of the memory latency on the models ............................................................................. 52 7.4. Analysis of the benchmark for time variation............................................................................. 53 7 7.5. Model scalability in terms of warps ............................................................................................ 57 7.6. Conclusion ................................................................................................................................... 58 8. Variable rate warp scheduling ............................................................................................................ 59 8.1. Motivation for the variable rate warp scheduler ....................................................................... 59 8.2. Implementation and Results ....................................................................................................... 60 8.3. Conclusion ................................................................................................................................... 63 9. Conclusions and future work .............................................................................................................. 63 9.1. Summary ..................................................................................................................................... 63 9.2. Future work ................................................................................................................................. 64 9.3. Conclusion ..................................................................................................................................

Load more