AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATIC DISTRIBUTED PROGRAMMING USING SEQUENCEL by Bryant K. Nelson, B.S. A Dissertation In COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Submitted to: Dr. Nelson Rushton Chair of Committee Dr. Richard Watson Dr. Bradley Nemanich Dr. Yong Chen Dr. Mark Sheridan Dean of the Graduate School August, 2016 c Bryant K. Nelson, 2016 Texas Tech University, Bryant Nelson, August 2016 For Dr. Dan Cooke, in memoriam. ii Texas Tech University, Bryant Nelson, August 2016 ACKNOWLEDGMENTS There are many people that made getting to where I am today possible. While I can never truly express the extent of my gratitude for these people, here I would like to take some time to acknowledge them and their contributions. First, I would like to acknowledge Dr. Nelson Rushton, the chair of my committee, my academic advisor, and my friend. Four years ago Dr. Rushton took me on as a PhD student, fresh out of my bachelor's programs. Over these past four years he has been there to guide me and offer advice as both a mentor and a friend. From collaborating on papers and guiding me down research paths to helping me get my grouping tighter and draw faster. Without Dr. Rushton's guidance and insight, I doubt that I would have made it this far in my academic career. Second, I would like to acknowledge my friend and colleague, Josh Archer. Josh and I actually started our undergraduate degrees at the same time and went through the same program, dual majoring in Computer Science and Mathematics. However, it was not until graduate school that Josh and I became friends, and I am thankful we did. We have collaborated on numerous projects, papers and endeavors, both academic and entertaining. Josh is often able to provide an alternate opinion or view on a subject which has influenced the way I view the world. Third, I would like to acknowledge Dr. Bradley Nemanich my colleague, mentor, and friend. When I first heard of Brad I knew of him only as the person that Dr. Cooke would communicate with to get me a copy of SequenceL. Over the years Brad has become somewhat of an intellectual role model, providing me with invaluable advice and insight. He was always available to listen and offer suggestions whenever I would encounter one of the numerous roadblocks along the way. It is Brad's research that laid the groundwork for the work I have done, without his shoulders to stand on iii Texas Tech University, Bryant Nelson, August 2016 I would never have made it this far. I would like to thank my friend, JD Davenport for providing me access to the HPC cluster used to test the work done in this dissertation. This made the testing process much simpler and painless. My family has provided encouragement and support throughout the years. They keep me grounded and are always there when I need them. I would like to specifically thank my brother, Tyler Nelson, who was there for me when I was going through a particularly difficult time. Last, and far from least, I would like to acknowledge my girlfriend, Amber Cran- dall. She stood by me through the toughest parts of my research. Though the late nights and long days, she was there to keep me focused and was patient with me when I was distant. For that I am thankful. Bryant K. Nelson Texas Tech University August, 11, 2016 iv Texas Tech University, Bryant Nelson, August 2016 CONTENTS ACKNOWLEDGMENTS . iii ABSTRACT . ix LIST OF FIGURES . xi LIST OF TABLES . xiii NOMENCLATURE . xiv I INTRODUCTION . .1 1.1 Motivation . .1 1.2 Problem Statement . .2 1.3 Dissertation Overview . .3 II BACKGROUND . .5 2.1 SequenceL . .5 2.1.1 Normalize Transpose(NT) . .6 2.1.2 Indexed Functions . .7 2.1.3 Consume Simplify Produce(CSP) . .8 2.1.4 Entry Points and Intended Uses . .9 2.1.5 SequenceL Compiler Overview . 10 2.1.6 SequenceL Runtime Library . 10 2.1.7 SequenceL C++ Driver Library . 11 2.2 Parallel Computing Architectures . 11 2.2.1 Shared Memory . 11 2.2.1.1 Pthreads . 13 2.2.1.2 OpenMP . 15 2.2.1.3 TBB . 17 2.2.2 Distributed Memory . 19 v Texas Tech University, Bryant Nelson, August 2016 2.2.2.1 MPI . 19 III RELATED WORK . 24 3.1 Parallel Programming Languages . 24 3.1.1 NESL . 24 3.1.2 Sisal . 24 3.2 Manual Heterogeneous Computing . 25 3.3 Automatic Distributed Computing . 26 IV EXPERIMENTAL DESIGN . 27 4.1 Environment . 27 4.2 Metrics . 28 4.3 Test Programs . 29 V PHASE 1 . 32 5.1 Introduction . 32 5.2 Framework Design . 32 5.2.1 Target Programs . 32 5.2.2 C++ Driver Library Additions . 33 5.2.2.1 Distributed Sequence Class . 34 5.2.2.2 Utility Functions . 37 5.2.3 Distributed Execution . 37 5.3 Experimental Design . 38 5.3.1 Test Programs . 38 5.3.1.1 First Class of Test Problems . 38 5.3.1.2 Second Class of Test Problems . 38 5.3.2 Experimental Results . 39 5.3.2.1 Matrix Multiply . 39 5.3.2.2 Monte Carlo Mandelbrot Area . 42 vi Texas Tech University, Bryant Nelson, August 2016 5.3.2.3 Pi Approximation . 46 5.4 Conclusion and Next Steps . 49 5.4.1 Next Steps . 49 5.4.1.1 Improvements in the Runtime . 50 5.4.1.2 Additional Distribution Targets . 50 VI PHASE 2 . 52 6.1 Introduction . 52 6.2 Compiler Modifications . 52 6.2.1 Program Targets . 52 6.2.1.1 Parallelization Targets . 52 6.2.1.2 Excluded Programs . 54 6.2.2 Runtime Additions . 55 6.2.3 Generated Code Additions . 56 6.3 Experimental Design . 57 6.3.1 Test Programs . 57 6.3.2 Experimental Results . 57 6.3.2.1 Monte Carlo Mandelbrot Area . 58 6.3.2.2 Matrix Multiply . 62 6.3.2.3 LU Factorization . 64 6.3.2.4 PI Approximation . 66 VII CONCLUSIONS & FUTURE WORK . 70 7.1 Conclusions . 70 7.1.1 Contributions . 70 7.2 Future Work . 71 7.2.1 Optimizations . 71 7.2.2 Improvements . 72 vii Texas Tech University, Bryant Nelson, August 2016 INDEX . 74 BIBLIOGRAPHY . 82 APPENDIX . 83 A. SequenceL Grammar . 84 B. Experimental Data . 87 C. Test Programs . 117 viii Texas Tech University, Bryant Nelson, August 2016 ABSTRACT Hybrid parallel programming, consisting of a distributed memory model for in- ternode communication in combination with a shared-memory model to manage in- tranode parallelisms, is now a common method of achieving scalable parallel per- formance. Such a model burdens developers with the complexity of managing two parallel programming systems in the same program. I have hypothesized it is possible to specify heuristics which, on average, allow scalable across-node (distributed mem- ory) and across-core (shared memory) hybrid parallel C++ to be generated from a program written in a high-level functional language. Scalable here means a dis- tributed core-speedup that is no more than an order of magnitude worse than shared memory core-speedup. This dissertation reports the results of testing this hypothe- sis by extending the SequenceL compiler to automatically generate C++ which uses a combination of MPI and Intel's TBB to achieve scalable distributed and shared memory parallelizations. ix Texas Tech University, Bryant Nelson, August 2016 LIST OF FIGURES 2.1 Normalize-Transpose Illustration [Nelson & Rushton, 2013] . .6 2.2 Consume Simplify Produce Illustration [Nelson & Rushton, 2013] . .9 2.3 Shared Memory System Illustration . 12 2.4 Distributed Memory System Illustration . 20 5.1 Illustration of the Distributed Sequence Distribution Structure . 35 5.2 Phase 1: MM on HPC Cluster Performance Graphs . 42 5.3 Phase 1: MAN on HPC Cluster Performance Graphs . 46 5.4 Phase 1: PI on HPC Cluster Performance Graphs . 49 6.1 Generic Call Graph . 53 6.2 quicksort Operation Tree . 55 6.3 Extended Parallel For-Loop Illustration . 56 6.4 Illustration of the Phase 2 Distribution Scheme . 58 6.5 Phase 2: MAN on HPC Cluster Performance Graphs . 59 6.6 Phase 2: MM on HPC Cluster Performance Graphs . 62 6.7 Phase 2: LU on HPC Cluster Performance Graphs . 64 6.8 Phase 2: PI on HPC Cluster Performance Graphs . 67 B.1 Phase 1: 2DFFT on Virtual Network Performance Graphs . 87 B.2 Phase 1: BHUT on Virtual Network Performance Graphs . 88 B.3 Phase 1: GOL on Virtual Network Performance Graphs . 89 B.4 Phase 1: LU on Virtual Network Performance Graphs . 90 B.5 Phase 1: 2DFFT on Networked PCs Performance Graphs . 91 B.6 Phase 1: BHUT on Networked PCs Performance Graphs . 92 x Texas Tech University, Bryant Nelson, August 2016 B.7 Phase 1: GOL on Networked PCs Performance Graphs . 93 B.8 Phase 1: LU on Networked PCs Performance Graphs . 94 B.9 Phase 1: 2DFFT on HPC Cluster Performance Graphs . 95 B.10 Phase 1: BHUT on HPC Cluster Performance Graphs . 97 B.11 Phase 1: GOL on HPC Cluster Performance Graphs . 99 B.12 Phase 1: LU on HPC Cluster Performance Graphs . 101 B.13 Phase 2: 2DFFT on Networked PCs Performance Graphs . 103 B.14 Phase 2: BHUT on Networked PCs Performance Graphs . 104 B.15 Phase 2: GOL on Networked PCs Performance Graphs . ..