Da Li July 2016
Total Page:16
File Type:pdf, Size:1020Kb
FACILITATING EMERGING APPLICATIONS ON MANY-CORE PROCESSORS ________________________________________________ A Dissertation Presented to the Faculty of the Graduate School at the University of Missouri-Columbia ___________________________________________________ In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy ______________________________________________ by DA LI Dr. Michela Becchi, Supervisor JULY 2016 The undersigned, appointed by the dean of the Graduate School, have examined the dissertation entitled FACILITATING EMERGING APPLICATIONS ON MANY-CORE PROCESSORS presented by Da Li, a candidate for the degree of doctor of philosophy and hereby certify that, in their opinion, it is worthy of acceptance. Dr. Michela Becchi Dr. Tony Han Dr. Zhihai He Dr. William Harrison Dr. Jason Xu ACKNOWLEDGEMENTS First and foremost, I would like to deeply thank my adviser, Dr. Michela Becchi, for her valuable guidance and advice and for her vast reserve of patience and knowledge. I am fortunate and grateful to have worked with her during my studies. I also appreciate her generosity and flexibility in offering me opportunities to interact with other scientists and explore various projects through internships from industries. I would like also to express my sincere gratitude to Dr. Tony Han, Dr. Zhihai He, Dr. William Harrison and Dr. Jason Xu for their teaching, mentoring and assistance. I want to thank Dr. Ziliang Zong and Xinbo from Texas State University for their expertise, feedback and overall support. I would like to thank all the current and previous members of the Networking and Parallel System laboratory, especially Kittisak, Huan, Ruidong and Henry who helped me with my research. Finally, I would like to acknowledge my family and friends for their encouragement and devotion. I am very grateful to have a supportive family who puts my education as the first priority. I would express my deepest appreciation and thanks to my wife, Xin Tong, who is better at science than me, for her love and support. This work has been supported by National Science Foundation awards CNS-1216756, CCF-1452454, CNS-1305359, and by gifts and equipment donations from NEC Labs America and Nvidia Corporation. ii Table of Contents ACKNOWLEDGEMENTS .......................................................................................... ii LIST OF FIGURES .................................................................................................... vii LIST OF TABLES ....................................................................................................... xi ABSTRACT ................................................................................................................ xii Chapter 1 Introduction .................................................................................................. 1 1.1 Introduction ......................................................................................................... 1 1.2 Contributions ...................................................................................................... 4 1.3 Dissertation Organization ................................................................................... 5 Chapter 2 Background .................................................................................................. 6 2.1 GPU Architecture and Programming Model ...................................................... 6 2.2 Needleman-Wunsch Algorithm .......................................................................... 9 2.3 Irregular applications ........................................................................................ 11 2.4 Convolutional neural networks ......................................................................... 12 Chapter 3 Bioinformatics Applications on CPU-GPU Clusters ................................. 15 3.1 Motivation ......................................................................................................... 15 3.2 Related Work .................................................................................................... 16 3.2.1 Early Works on Sequence Alignments ...................................................... 16 3.2.2 Acceleration of Bioinformatics Application on GPUs .............................. 17 iii 3.2.3 Sequence Alignments on GPUs ................................................................. 18 3.3.4 Rodinia-NW: Needleman-Wunsch on GPU .............................................. 19 3.3 Design of GPU-workers .................................................................................... 21 3.3.1 TiledDScan-mNW: Multiple alignments with tiling ................................. 21 3.3.2 DScan-mNW: Single-kernel diagonal scan ............................................... 22 3.3.3 RScan-mNW: Row scan via single CUDA core ........................................ 24 3.4 Experimental Evaluation ................................................................................... 28 3.4.1 Experimental setup ..................................................................................... 28 3.4.2 Performance on single GPU ....................................................................... 28 3.5 Other Applications with Similar Computation Pattern ..................................... 35 Chapter 4 Irregular Applications on Many-Core Processors ...................................... 40 4.1 Related Work .................................................................................................... 41 4.2 Unified Programming Interface ........................................................................ 44 4.2.1 General Design & Methodology ................................................................ 46 4.2.2 Programming API ...................................................................................... 47 4.2.3 Case Study ................................................................................................. 50 4.2.4 Compiler Design ........................................................................................ 54 4.2.5 Runtime Library Design ............................................................................ 58 4.2.6 Performance Evaluation ............................................................................. 61 4.3 Adaptive Computing ......................................................................................... 68 iv 4.3.1 Motivation .................................................................................................. 69 4.3.2 Exploration Space ...................................................................................... 74 4.3.3 Implementation .......................................................................................... 81 4.3.4 Adaptive Runtime ...................................................................................... 86 4.3.5 Experimental Evaluation ............................................................................ 91 4.4 Parallelization Templates .................................................................................. 99 4.4.1 Motivation ................................................................................................ 100 4.4.2 Irregular Nested Loop .............................................................................. 101 4.4.3 Experimental Evaluation .......................................................................... 105 4.5 Workload Consolidation ................................................................................. 113 4.5.1 Dynamic Parallelism ................................................................................ 114 4.5.2 Application Characterization ................................................................... 115 4.5.3 Motivation ................................................................................................ 116 4.5.4 Methodology ............................................................................................ 121 4.6.4 Experimental Evaluation .......................................................................... 132 Chapter 5 Deep Neural Network on CPU-GPU ....................................................... 141 5.1 Related Work .................................................................................................. 141 5.2 Energy Efficiency ........................................................................................... 142 5.2.1 Motivation ................................................................................................ 142 5.2.2 Methodology ............................................................................................ 144 v 5.2.3 Overall Results on CPU & GPU .............................................................. 146 5.2.4 Effect of NN and Batch Size Configuration ............................................ 151 5.2.5 Effect of Hardware Settings ..................................................................... 153 5.3 CNN on CPU-GPU heterogeneous architecture ............................................. 158 5.3.1 Motivation ................................................................................................ 158 5.3.2 HetNet ...................................................................................................... 159 5.3.3 HybNet ..................................................................................................... 161 5.3.4 Experimental Evaluation .......................................................................... 162