High-Performance Algorithms and Software for Large-Scale Molecular Simulation
Total Page:16
File Type:pdf, Size:1020Kb
HIGH-PERFORMANCE ALGORITHMS AND SOFTWARE FOR LARGE-SCALE MOLECULAR SIMULATION A Thesis Presented to The Academic Faculty by Xing Liu In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computational Science and Engineering Georgia Institute of Technology May 2015 Copyright ⃝c 2015 by Xing Liu HIGH-PERFORMANCE ALGORITHMS AND SOFTWARE FOR LARGE-SCALE MOLECULAR SIMULATION Approved by: Professor Edmond Chow, Professor Richard Vuduc Committee Chair School of Computational Science and School of Computational Science and Engineering Engineering Georgia Institute of Technology Georgia Institute of Technology Professor Edmond Chow, Advisor Professor C. David Sherrill School of Computational Science and School of Chemistry and Biochemistry Engineering Georgia Institute of Technology Georgia Institute of Technology Professor David A. Bader Professor Jeffrey Skolnick School of Computational Science and Center for the Study of Systems Biology Engineering Georgia Institute of Technology Georgia Institute of Technology Date Approved: 10 December 2014 To my wife, Ying Huang the woman of my life. iii ACKNOWLEDGEMENTS I would like to first extend my deepest gratitude to my advisor, Dr. Edmond Chow, for his expertise, valuable time and unwavering support throughout my PhD study. I would also like to sincerely thank Dr. David A. Bader for recruiting me into Georgia Tech and inviting me to join in this interesting research area. My appreciation is extended to my committee members, Dr. Richard Vuduc, Dr. C. David Sherrill and Dr. Jeffrey Skolnick, for their advice and helpful discussions during my research. Similarly, I want to thank all of the faculty and staff in the School of Compu- tational Science and Engineering at Georgia Tech. I am also thankful for the support and guidance of Dr. Mikhail Smelyanskiy and Dr. Pradeep Dubey when I was interning at Intel Labs. To my friends at Georgia Tech, Jee Choi, Kent Czechowski, Marat Dukhan, Oded Green, Adam McLaughlin, Robert McColl, Lluis Miquel Munguia, Aftab Patel, Piyush Sao, Zhichen Xia, and Zhaoming Yin, thanks for helping me in many ways and their in- spiring discussions on research (and other things). Finally, and most importantly, my recognition goes out to my family, specially to my beloved wife, Ying Huang, for their support, encouragement and patience during my pur- suit of PhD. Thanks for their faith in me and allowing me to be as ambitious as I wanted. Without their constant support and encouragement, I would not have completed the PhD program. iv Contents DEDICATION ..................................... iii ACKNOWLEDGEMENTS .............................. iv LIST OF TABLES .................................. x LIST OF FIGURES .................................. xii SUMMARY ...................................... xv I INTRODUCTION ................................ 1 1.1 Background . 1 1.2 Motivation . 5 II SCALABLE DISTRIBUTED PARALLEL ALGORITHMS FOR FOCK MA- TRIX CONSTRUCTION ............................ 7 2.1 Background: Hartree-Fock Method and Fock Matrix Construction . 7 2.1.1 Hartree-Fock Equations . 7 2.1.2 Hartree-Fock Algorithm . 9 2.1.3 Electron Repulsion Integrals . 10 2.1.4 Screening . 11 2.2 Challenges of Parallelizing Fock Matrix Construction . 13 2.3 Limitations of Previous Work . 14 2.4 New Algorithm for Parallel Fock Matrix Construction . 16 2.4.1 Overview . 16 2.4.2 Task Description . 17 2.4.3 Initial Static Partitioning . 18 2.4.4 Shell Reordering . 19 2.4.5 Algorithm . 20 2.4.6 Work-Stealing Scheduler . 22 2.4.7 Performance Model and Analysis . 22 2.5 Heterogeneous Fock Matrix Construction . 26 v 2.6 Experimental Results . 28 2.6.1 Experimental Setup . 28 2.6.2 Performance of Heterogeneous Fock Matrix Construction . 29 2.6.3 Performance of Distributed Fock Matrix Construction . 30 2.6.4 Analysis of Parallel Overhead . 33 2.6.5 Load Balance Results . 33 2.7 Summary . 35 III HARTREE-FOCK CALCULATIONS ON LARGE-SCALE DISTRIBUTED SYSTEMS ..................................... 36 3.1 Current State-of-the-Art . 36 3.2 Improving Parallel Scalability of Fock Matrix Construction . 37 3.3 Optimization of Integral Calculations . 38 3.4 Computation of the Density Matrix . 41 3.5 Experimental Results . 43 3.5.1 Experimental Setup . 43 3.5.2 Scaling Results . 44 3.5.3 Comparison to NWChem . 48 3.5.4 HF Strong Scaling Results . 48 3.5.5 HF Weak Scaling Results . 49 3.5.6 Flop Rate . 50 3.6 Summary . 53 IV “MATRIX-FREE” ALGORITHM FOR HYDRODYNAMIC BROWNIAN SIMULATIONS ................................. 55 4.1 Background: Conventional Ewald BD Algorithm . 56 4.1.1 Brownian Dynamics with Hydrodynamic Interactions . 56 4.1.2 Ewald Summation of the RPY Tensor . 56 4.1.3 Brownian Displacements . 58 4.1.4 Ewald BD Algorithm . 58 4.2 Motivation . 58 vi 4.3 Related Work . 60 4.4 Matrix-Free BD Algorithm . 60 4.4.1 Particle-Mesh Ewald for the RPY Tensor . 61 4.4.2 Computing Brownian Displacements with PME . 65 4.4.3 Matrix-Free BD Algorithm . 66 4.5 Hybrid Implementation of PME . 66 4.5.1 Reformulating the Reciprocal-Space Calculation . 66 4.5.2 Optimizing the Reciprocal-Space Calculation . 68 4.5.3 Computation of Real-Space Terms . 71 4.5.4 Performance Modelling and Analysis . 71 4.5.5 Hybrid Implementation on Intel Xeon Phi . 73 4.6 Experimental Results . 75 4.6.1 Experimental Setup . 75 4.6.2 Accuracy of the Matrix-Free BD Algorithm . 76 4.6.3 Simulation Configurations . 77 4.6.4 Performance of PME . 78 4.6.5 Performance of BD Simulations . 81 4.7 Summary . 84 V IMPROVING THE PERFORMANCE OF STOKESIAN DYNAMICS SIM- ULATIONS VIA MULTIPLE RIGHT-HAND SIDES ............. 86 5.1 Motivation . 86 5.2 Background: Stokesian Dynamics . 88 5.2.1 Governing Equations . 88 5.2.2 Resistance Matrix . 89 5.2.3 Brownian Forces . 90 5.2.4 SD Algorithm . 90 5.3 Exploiting Multiple Right-Hand Sides in Stokesian Dynamics . 91 5.4 Generalized Sparse Matrix-Vector Products with Multiple Vectors . 93 5.4.1 Performance Optimizations for SPIV . 94 vii 5.4.2 Performance Model . 96 5.4.3 Experimental Setup . 99 5.4.4 Experimental Results . 100 5.5 Stokesian Dynamics Results . 104 5.5.1 Simulation Setup . 105 5.5.2 Experimental Results . 106 5.6 Summary . 113 VI EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON X86-BASED MANY-CORE PROCESSORS ......................... 116 6.1 Related Work . 116 6.2 Understanding the Performance of SpMV on Intel Xeon Phi . 118 6.2.1 Test Matrices and Platform . 118 6.2.2 Overview of CSR Kernel . 119 6.2.3 Performance Bounds . 120 6.2.4 Performance Bottlenecks . 121 6.3 Ellpack Sparse Block Format . 125 6.3.1 Motivation . 125 6.3.2 Proposed Matrix Format . 128 6.3.3 SpMV Kernel with ESB Format . 132 6.3.4 Selecting c and w . 133 6.4 Load Balancers for SpMV on Intel Xeon Phi . 134 6.4.1 Static Partitioning of Cache Misses . 135 6.4.2 Hybrid Dynamic Scheduler . 135 6.4.3 Adaptive Load Balancer . 136 6.5 Experimental Results . 137 6.5.1 Load Balancing Results . 137 6.5.2 ESB Results . 138 6.5.3 Performance Comparison . 140 6.6 Summary . 142 viii VII CONCLUSIONS ................................. 144 Appendix A — OVERVIEW OF INTEL XEON PHI .............. 147 Appendix B — GTFOCK DISTRIBUTED FRAMEWORK .......... 151 Appendix C — STOKESDT TOOLKIT ..................... 157 REFERENCES ..................................... 161 VITA .......................................... 171 ix List of Tables 1 Common computational problems in molecular simulation. 3 2 Test molecules. 29 3 Speedup compared to single socket Westmere (WSM) processor. 30 4 Fock matrix construction time (in seconds) for GTFock and NWChem on four test cases. Although NWChem is faster for smaller core counts, GT- Fock is faster for larger core counts. 30 5 Speedup in Fock matrix construction for GTFock and NWChem on four test cases, using the data in the previous table. Speedup for both GTFock and NWChem is computed using the fastest 12-core running time, which is from NWChem. GTFock has better speedup at 3888 cores. 31 6 Average time, tint, for computing each ERI for GTFock (using the ERD library) and NWChem. 31 7 Average Global Arrays communication volume (MB) per MPI process for GTFock and NWChem. 34 8 Average number of calls to Global Arrays communication functions per MPI process for GTFock and NWChem. 34 9 Load balance ratio l = Tf ock;max=Tf ock;avg for four test molecules. A value of 1:000 indicates perfect load balance. 34 10 Test molecules. 45 11 Timings (seconds) for 1hsg 180 on Tianhe-2. Top half of table is CPU-only mode; bottom half is heterogeneous mode. 50 12 Timing data (seconds) used to approximate weak scaling Tianhe-2. Top half.