A Framework for Performance Optimization of Tensor Contraction Expressions
Total Page:16
File Type:pdf, Size:1020Kb
A Framework for Performance Optimization of Tensor Contraction Expressions DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Pai-Wei Lai, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2014 Dissertation Committee: P. Sadayappan, Advisor Gagan Agrawal Atanas Rountev © Copyright by Pai-Wei Lai 2014 ABSTRACT Attaining high performance and productivity in the evaluation of scientific ap- plications is a challenging task for computer scientists, and is often critical in the advancement of many scientific disciplines. In this dissertation, we focus on the de- velopment of high performance, scalable parallel programs for a class of scientific computations in quantum chemistry — tensor contraction expressions. Tensor contraction expressions are generalized forms of multi-dimensional matrix- matrix operations, which form the fundamental computational constructs in elec- tronic structure modeling. Tensors in these computations exhibit various types of symmetry and sparsity. Contractions on such tensors are highly irregular with sig- nificant computation and communication cost, if data locality is not considered in the implementation. Prior efforts have focused on implementing tensor contrac- tions using block-sparse representation. Many parallel programs of tensor contrac- tions have been successfully implemented, however, their performances are unsat- isfactory on emerging computer systems. In this work, we investigate into several performance bottlenecks of previous ap- proaches, and present responding techniques to optimize operations, parallelism, workload balance, and data locality. We exploit symmetric properties of tensors to minimize the operation count of tensor contraction expressions through algebraic ii transformation. Rules are formulated to discover symmetric properties of inter- mediate tensors; cost models and algorithms are developed to reduce operation counts. Our approaches result in significant operation count reduction, compared to many other state of the art computational chemistry softwares, using examples from real-world tensor contraction expressions from the coupled cluster methods. In order to achieve high performance and scalability, multiple programming models are often used in a single application. We design a domain-specific frame- work which utilizes the partitioned global address space programming model for data management and inter-node communication. We employ the task parallel execution model for dynamic load balancing. Tensor contraction expressions are decomposed into a collection of computational tasks operating on tensor tiles. We eliminate most of the synchronization steps by executing independent tensor con- tractions concurrently, and present mechanisms to improve their data locality. Our framework shows improved performance and scalability for tensor contraction ex- pressions from representative coupled cluster methods. iii To Li-Lun, for the laughter everyday; To Aaron, for the crying every night; And to Ola, for the purring once in a while. iv ACKNOWLEDGMENTS I have relied on the help of many people in this effort. I would like to express my greatest appreciation to my advisor, Dr. P. Sadayappan, for his continuous guidance and support over the past five years. I feel very fortunate to have worked with him and have learned a great deal from him. I am grateful to my dissertation committee, Dr. Gagan Agrawal and Dr. Atanas Rountev, and the graduate faculty representative, Dr. Christopher Miller, for their valuable suggestions to improve my work. I am sincerely grateful to have had the opportunity to work so closely with many of the best minds in computer science and chemistry. I thank my mentor, Dr. Sri- ram Krishnamoorthy, for two memorable summer internships at Pacific Northwest National Laboratory, and Dr. Karol Kowalski, Dr. Edward Valeev, Dr. Marcel Nooijen, and Dr. Dmitry Lyakh, for helping me understand the chemistry aspects of my research. Special thanks to Dr. Albert Hartono and Dr. Huaijian Zhang, for their assistance in the enhancement of OpMin; and Dr. Wenjing Ma, for providing important insights into the development of DLTC. I am thankful to all my friends in Columbus for making my life here truly amaz- ing and enjoyable: Qingpeng Niu, Humayun Arafat, Naznin Fauzia, Kevin Stock, Mahesh Ravishankar, Sanket Tavarageri, Martin Kong, Justin Holewinski, Tom Hen- retty, Samyam Rajbhandari, Akshay Nikam, Venmugil Elango, and many others. v I’m also particularly grateful for the constant support of Yu-Keng Shih, Chun-Ming Chen, En-Hsiang Tseng, Debbie Lee, Ko-Chih Wang, Kang-Che Lee, Tzu-Hsuan Wei, my Saturday morning pickup game friends, and countless friends from the Taiwanese Student Association. Finally, I am deeply grateful for the love and support of my family: my inspiring parents, Feng-Wei and Mei-Chen; my lovely wife, Li-Lun; my adorable son, Aaron; my brother, Pai-Ching; and my cat, Ola. Thank you for always backing me up. Words cannot express how much I love you all. Pai-Wei Lai Columbus, Ohio August 25, 2014 vi VITA January 9, 1984 . Born: Taipei, Taiwan June 2001 . B.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan June 2005 . M.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2009 — present . Graduate Research Associate, The Ohio State University, Columbus, OH, USA Summer 2011 . Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA Summer 2012 . Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA vii PUBLICATIONS Qingpeng Niu, Pai-Wei Lai, S.M. Faisal, Srinivasan Parthasarathy, and P. Sadayap- pan: “A Fast Implementation of MLR-MCL Algorithm on Multi-core Processors”. To appear in International Conference on High Performance Computing (HiPC’14), Goa, India, December 17–20, 2014. Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “Communication-Optimal Framework for Contracting Distributed Tensors”. To appear in Supercomputing (SC’14), New Orleans, LA, USA, November 16–21, 2014. Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “CAST: Contraction Algorithms for Symmetric Tensors”. To appear in International Conference on Parallel Processing (ICPP’14), Minneapolis, MN, USA, September 9–12, 2014. Pai-Wei Lai, Humayun Arafat, Venmugil Elango, and P. Sadayappan: “Accelerat- ing Strassen-Winograd’s Algorithm on GPUs”. In International Conference on High Performance Computing (HiPC’13), Bengaluru (Bangalore), India, December 18–21, 2013. Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, Sriram Krishnamoorthy, and P. Sa- dayappan: “A Framework for Load Balancing of Tensor Contraction Expressions via Dynamic Task Partitioning”. In Supercomputing (SC’13), Denver, CO, USA, November 17–22, 2013. Pai-Wei Lai, Huaijian Zhang, Samyam Rajbhandari, Edward Valeev, Karol Kowal- ski, and P. Sadayappan: “Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions”. In International Conference on Computational Science (ICCS’12), Omaha, NE, USA, June 4–6, 2012. viii FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in: High Performance Computing Prof. P. Sadayappan Software Engineering Prof. Atanas Rountev Artificial Intelligence Prof. Eric Fosler-Lussier ix TABLE OF CONTENTS Page Abstract ......................................... ii Dedication ........................................ iv Acknowledgments ................................... v Vita ............................................ vii List of Figures ...................................... xiv List of Tables ....................................... xvii List of Algorithms ................................... xix List of Listings ...................................... xx Chapters: 1. Introduction .................................... 1 2. Background .................................... 6 2.1 Coupled Cluster Theory ......................... 6 2.2 Tensor Contraction Expressions ..................... 9 2.3 Tensor Contraction Engine ........................ 11 2.4 Partitioned Global Address Space Programming Models ...... 13 2.5 Task Parallel Programming Models ................... 15 2.6 Domain Specific Languages ....................... 17 x 3. Overview ..................................... 19 3.1 Operation Minimizer (OpMin) ..................... 21 3.1.1 Language parser ......................... 21 3.1.2 Operation optimizer ....................... 22 3.1.3 Code generator .......................... 23 3.2 Dynamic Load-balanced Tensor Contractions (DLTC) ........ 23 3.2.1 Dynamic task partitioning .................... 24 3.2.2 Dynamic task execution ..................... 25 4. Operation Minimization on Symmetric Tensors ............... 27 4.1 Introduction ................................ 27 4.2 Symmetry Properties of Tensors ..................... 29 4.2.1 Antisymmetry ........................... 29 4.2.2 Vertex symmetry ......................... 30 4.3 Methods ................................... 31 4.3.1 Derivation rules .......................... 32 4.3.2 Cost models ............................ 34 4.3.3 Operation minimization algorithms .............. 36 4.4 Results and Discussion .......................... 42 4.4.1 Experimental setup ........................ 42 4.4.2 Importance of Symmetry Properties .............. 43 4.4.3 Performance evaluation of OpMin algorithms .......