Techniques for Characterizing the Data Movement Complexity of Computations

Techniques for Characterizing the Data Movement Complexity of Computations DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Venmugil Elango, Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee: Prof. P. Sadayappan, Advisor Dr. Fabrice Rastello, Co-Advisor Prof. Atanas Rountev Prof. Radu Teodorescu © Copyright by Venmugil Elango 2016 ABSTRACT The execution cost of a program, both in terms of time and energy, comprises computational cost and data movement cost (e.g., cost of transferring data between CPU and memory devices, between parallel processors, etc.). Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity alone will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data movement complexity will be increasingly important. In their seminal work, Hong & Kung proposed the red-blue pebble game to model the data movement complexity of algorithms. Using the pebble game abstraction, Hong & Kung proved tight asymptotic lower bounds for the data movement complexity of several algorithms by reformulating the problem as a graph partitioning problem. In this dissertation, we develop a novel alternate graph min-cut based lower bounding technique. Using our technique, we derive tight lower bounds for different algorithms, with upper bounds matching within a constant factor. Further, we develop a dynamic analysis based automated heuristic for our technique, which enables automatic analysis of arbitrary computations. We provide several use cases for our automated approach. This dissertation also presents a technique, built upon the ideas of Christ et al. [15], to derive asymptotic parametric lower bounds for a sub-class of computations, called ii affine computations. A static analysis based heuristic to automatically derive parametric lower bounds for affine parts of the computations is also presented. Motivated by the emerging interest in large scale parallel systems with intercon- nection networks and hierarchical caches with varying bandwidths at different levels, we extend the pebble game model to parallel system architecture to characterize the data movement requirements in large scale parallel computers. We provide interesting insights on architectural bottlenecks that limit the performance of algorithms on these parallel machines. Finally, using data movement complexity analysis, in conjunction with the roofline model for performance bounds, we perform an algorithm-architecture codesign explo- ration across an architectural design space. We model the maximal achievable performance and energy efficiency of different algorithms for a given VLSI technology, considering different architectural parameters. iii ACKNOWLEDGMENTS I would like to start with thanking my advisor Prof. P. Sadayappan, without whose guidance this work would have been impossible. I am always amazed by the enthusiasm and energy with which he approaches and discusses new ideas. Sugges- tions and ideas that he provided me during our discussions over the past five years have proved to be invaluable to me. I am grateful to my co-advisor, Dr. Fabrice Rastello. I learned a lot from his novel ideas and insightful comments. I admire his close attention to minor technical details and correctness of the ideas. I extend my gratitude to Dr. Louis-Noël Pouchet for his helpful suggestions on various parts of this work. Interactions with him, especially with regards to experiments and presentation of ideas, have been invaluable. I would like to thank Prof. J. Ramanujam for all the help he provided over the course of this work. Useful discussions with him, especially when we were close to various paper deadlines, were of immense help. I extend my thanks to Prof. Nasko Rountev for his helpful suggestions and feed- backs on our work related to dynamic analysis, and on the presentation of our work in various papers and talks. I would also remember interesting conversations I had with my labmates Martin Kong, Naser Sedaghati, Mahesh Ravishankar, Kevin Stock, Naznin Fauzia, Sanket iv Tavargeri, Samyam Rajbhandari, Prashant Rawat, Wenlei Bao and Changwan Hong during my period of stay at DL574. My friends Nithin, Manojprasadh, Saktheesh, Aravind, Ashok and Nikilesh ex- tended their help and support during the times I needed them the most. I will always cherish the memorable experiences I had with them. Finally, I am thankful to my parents and my family for all their support. v VITA Jul 2003 – May 2008 . Bachelor of Engineering Mechanical Engineering (Sandwich) PSG College of Technology Coimbatore, India. Jun 2008 – Aug 2010 . .Engineer GE Aviation Bangalore, India. Dec 2010 – Present . .Graduate Research Assistant The Ohio State University Columbus, Ohio. May 2013 – Aug 2013 . Intern Los Alamos National Laboratory Los Alamos, New Mexico. May 2015 – Aug 2015 . Intern Lawrence Livermore National Labora- tory Livermore, California. PUBLICATIONS Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, P. Sadayap- pan On Characterizing the Data Access Complexity of Programs. In ACM Symposium on Principles of Programming Languages, January 2015. Venmugil Elango, Naser Sedaghati, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanu- jam, Radu Teodorescu, P. Sadayappan On Using the Roofline Model with Lower Bounds on Data Movement. In ACM Transactions on Architecture and Code Optimization, Volume 11, Issue 4, December 2014. vi Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, P. Sadayap- pan On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution. In ACM Symposium on Parallelism in Algorithms and Architectures, June 2014. Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, P. Sadayappan Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential. In ACM Transactions on Architecture and Code Optimization, Volume 10 Issue 4, December 2013. Mahesh Ravishankar, Roshan Dathathri, Venmugil Elango, Louis-Noël Pouchet, J. Ra- manujam, Atanas Rountev, P. Sadayappan Distributed memory code generation for mixed Irregular/Regular computations. In ACM Symposium on Principles and Practice of Parallel Programming, February 2015. Pai-Wei Lai, Humayun Arafat, Venmugil Elango, P. Sadayappan Accelerating Strassen-Winograd’s matrix multiplication algorithm on GPUs. In IEEE High Performance Computing, December 2013. FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in High Performance Computing: Prof. P. Sadayappan vii TABLE OF CONTENTS Page Abstract ....................................... ii Acknowledgments .................................. iv Vita ......................................... vi List of Figures ................................... xi List of Tables .................................... xiv List of Algorithms ................................. xv List of Listings ................................... xvi Chapters: 1. Introduction .................................. 1 1.1 Data Movement Complexity ...................... 2 1.2 Computation Directed Acyclic Graph ................ 4 1.3 Characterizing the Data Movement Complexity of Computations . 6 1.4 Outline ................................. 9 2. Red-Blue Pebble Game ............................ 11 2.1 Background ............................... 12 2.2 Red-Blue Pebble Game Without Recomputation .......... 14 2.3 Decomposition ............................. 15 2.3.1 Disjoint Decomposition .................... 15 2.3.2 Non-disjoint Decomposition .................. 18 2.4 Input/Output Tagging/Untagging .................. 22 2.5 Related Work .............................. 24 viii 3. Lower Bounding Technique Based on Graph Min-cuts ........... 25 3.1 Background ............................... 25 3.1.1 S-partitioning technique .................... 25 3.1.2 Definitions from graph theory ................. 27 3.2 Min-cut Based Lower Bounding Technique .............. 28 3.3 Analytical Lower Bounds Using Min-cut Approach ......... 32 3.3.1 Diamond DAG ......................... 32 3.3.2 Fast Fourier transform (FFT) ................. 35 3.3.3 9-point Jacobi 2D (J2D) .................... 36 3.3.4 Conjugate Gradient (CG) ................... 39 3.3.5 Summary ............................ 43 3.4 Related Work .............................. 44 4. Automatic Min-cut Based Lower Bounds .................. 46 4.1 Algorithm for Convex Vertex-Min-Cut ................ 46 4.1.1 Overview ............................ 48 4.1.2 Construction of transformed graph .............. 50 4.1.3 Correctness ........................... 52 4.2 Heuristic for Min-Cut Based Lower Bounds ............. 55 4.2.1 CDAG extraction ....................... 56 4.2.2 Decomposition ......................... 58 4.2.3 Finding maximum min-cut .................. 58 4.3 Experimental Results ......................... 59 4.3.1 Assessment of effectiveness of automated analysis ...... 59 4.3.2 Parametric bounds through function fitting ......... 63 4.3.3 Use case 1: Algorithm comparison and optimization .... 64 4.3.4 Use case 2: Machine balance requirements .......... 67 4.4 Loop Fusion and Tileability Potential ................ 69 4.4.1 CDAG sampling ........................ 73 4.4.2 Experimental results .....................

Load more