Optimization of Stencil Computations on Gpus

Optimization of Stencil Computations on GPUs DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Prashant Singh Rawat, M.Tech. Graduate Program in Computer Science and Engineering The Ohio State University 2018 Dissertation Committee: Prof. P. Sadayappan, Advisor Prof. Atanas Rountev Prof. Gagan Agrawal © Copyright by Prashant Singh Rawat 2018 ABSTRACT Stencil computations form the compute-intensive core of many scientific application domains, such as image processing of CT and MRI imaging, computational electromagnetics, seismic processing, and climate modeling. A stencil computation involves element-wise update of an output domain based on a fixed set of neighboring points from the input domain. Such stencil computations are either time iterated, or require successive application of multiple stencil operators on the input domains. Stencil optimization on multi- and many-core architectures has been an active research topic for the past two decades. Stencil computations traditionally have low arithmetic intensity with only a few floating-point operations performed relative to the data transferred per output point, and are therefore memory bandwidth-bound. Since the data movement cost consistently dominates the computational cost in modern architectures, most of these research efforts focus on reducing the data movement in stencils to tackle the bandwidth bottleneck. Consequently, several tiling techniques have been proposed over the years to exploit spatial and temporal reuse across a sequence of stencils or across multiple time steps for time iterated stencil. With the ever-increasing use of GPUs for general purpose computing, application developers have started exploring the acceleration of data-parallel stencils on GPUs. GPUs have lower data movement costs than the multi-core CPU architectures, and ii hence are an attractive target for accelerating memory bandwidth-bound stencil computations. At the same time, GPUs are compute-intensive with significantly higher number of registers per thread, and therefore suitable for accelerating stencil computations with high arithmetic intensity as well. The arithmetic intensity of a stencil is proportional to its order, which is the number of input elements read from the center along each dimension. In many scientific applications, high-order stencils pro- vide better computational accuracy with lesser data movement than their low-order counterparts. However, the main performance bottleneck for high-order stencils on GPUs is the high register pressure, which causes excessive register spills or a steep drop in achieved parallelism, resulting in a subsequent performance loss. This dissertation proposes novel GPU-centric optimization strategies that address the performance bottlenecks for stencils with different arithmetic intensities: tiling and fusion heuristics for bandwidth-bound stencils with low arithmetic intensity, and register optimizations for high-order stencils with high arithmetic intensity. The proposed optimizations have been implemented into a DSL based stencil optimization framework, StencilGen, that can automatically generate high-performance CUDA code from an input DSL specification of the stencil computation. The efficacy ofthe proposed optimizations is demonstrated via empirical evaluation on a variety of 2D and 3D stencil kernels extracted from PDE solvers, image processing pipelines, and proxy DOE applications. iii To my mother and my sisters, for their unbounded love and unconditional support. iv ACKNOWLEDGMENTS Now that I look back, this part of life was, in measures, exciting, overwhelming and exhausting. I experienced everything, from first-year jitters, second-year blues, and home sickness, to working tirelessly in burst mode, frustration over rejected papers, and the thrill of publishing; I will cherish these experiences for life. Thankfully, I was not alone in this journey. There were many people who treaded along, making things easier, and they deserve a shout-out. I could have not found a better advisor than Prof. P. Sadayappan. I have improved on many aspects by listening to his discussions, critiques, and invaluable advice over the past five years. He steered my research when I needed it, and gave mefree reins at other times to hone the independent researcher in me. He provided me with internship opportunities, supported me through tough times, and I am certain, helped in many more behind-the-scene situations that I would not even know about. He is an exceptional mentor to all his students. I am grateful to Prof. Atanas Rountev for his constant help during the past five years, and especially during the first two years of the grad school. Most of my papers made it to publication because of his insightful advice, careful edits, and excellent suggestions on how to translate ideas into words. I am thankful to Louis-Noël Pouchet and Fabrice Rastello for their contributions to various parts of the dissertation. Louis-Noël has especially been instrumental in v shaping up the experimental sections in all my papers. Fabrice provided several insights on the reordering framework and the related work on register allocation in general. The years I spent at IIT Bombay gave me a new direction. I had the most patient M.Tech advisor, Prof. Supratim Biswas. I am eternally grateful to Prof. Uday Khedker for introducing me to compilers during my Bachelors, letting me be a part of GCC Resource Center (GRC) during my Masters, introducing me to Prof. Biswas and Saday, and bridging my path from IITB to OSU. I spent five years working with him, and I could spend another five talking about how amazing the days with him in GRC were. I want to express my gratitude to all the present and past fellow lab mates at OSU: Venmugil, Israt, Aravind, Mahesh, Sanket, Shashank, Ankur, Martin, Tom, Kevin, Naznin, Naser, Samyam, Changwan, Jinsung, Wenlei, Miheer, Vineeth, Rohit, Kunal, Emre, Gordon. Also, my friends at IITB: Nisha, Akshar, Salil, Subhajit, Harbaksh, Prajakta, Vini, Jubi, Ishita, Ashish, Prachi, Pritam, Swati, Swaroop, Shubhangi. Thanks for the interesting conversations! I am grateful to my close friends Bharat, Prashal, Santosh, and Ravi, who knew me when I was a confused teen, and helped me transition into what I am today. And finally, my family back in India – my mother, sisters, brother-in-laws, the whole entourage of nieces and nephews – and my family here, Umesh. Their unconditional love and support is what keeps me going on. Without their encouragement, I could have not reached this far. People rightly say that distance makes the heart grow fonder. I wish I had the vocabulary to do justice to my love for them. This work is as much theirs as it is mine. vi VITA June 2003 – May 2007 . Bachelor of Engineering, CSE Mumbai University Mumbai, India. June 2007 – May 2009 . Research Assistant IIT Bombay Mumbai, India. June 2009 – July 2012 . Master of Technology, CSE IIT Bombay Mumbai, India. August 2012 – Present . Graduate Research Assistant The Ohio State University Columbus, Ohio. May 2015 – August 2015 . Intern NVidia Corporation Redmond, Washington. May 2016 – June 2016 . Intern Lawrence Livermore National Lab. Livermore, California. May 2017 – August 2017 . Intern NVidia Corporation Redmond, Washington. PUBLICATIONS Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan GPU Code Optimization via Abstract Kernel Emulation and Sensitivity Analysis. To appear in ACM SIGPLAN conference on Pogramming Language Design and Im- plementation (PLDI), June 2018. vii Prashant Singh Rawat, Aravind Sukumaran-Rajam, Atanas Rountev, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan Register Optimizations for Stencils on GPUs. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018. Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan POSTER: Performance Modeling for GPUs using Abstract Kernel Emu- lation. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018. Prashant Singh Rawat, Aravind Sukumaran-Rajam, Atanas Rountev, Louis-Noël Pouchet, Fabrice Rastello, P. Sadayappan POSTER: Statement Reordering to Alleviate Register Pressure for Sten- cils on GPUs. In International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2017. Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis- Noël Pouchet, Atanas Rountev, P. Sadayappan Resource Conscious Reuse-Driven Tiling for GPUs. In International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2016. Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis- Noël Pouchet, Atanas Rountev, P. Sadayappan Effective resource management for enhancing performance of 2D and3D stencils on GPUs. In 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit (GPGPU), March 2016. Prashant Singh Rawat, Martin Kong, Thomas Henretty, Justin Holewinski, Kevin Stock, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, P. Sadayappan SDSLc: a multi-target domain-specific compiler for stencil computations. In 5th International Workshop on Domain-Specific Languages and High-Level Frame- works for High Performance Computing (WOLFHPC), November 2015. viii FIELDS OF

Load more