Msc THESIS Customizing Vector Instruction Set Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Computer Engineering 2007 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Customizing Vector Instruction Set Architectures C¸at¸alinBogdan CIOBANU Abstract Data Level Parallelism(DLP) can be exploited in order to improve the performance of processors for certain workload types. There are two main application ¯elds that rely on DLP, multimedia and scienti¯c computing. Most of the existing multimedia vector ex- tensions use sub-word parallelism and wide data paths for process- ing independent, mainly integer, values in parallel. On the other hand, classic vector supercomputers rely on e±cient processing of large arrays of floating point numbers typically found in scienti¯c applications. In both cases, the selection of an appropriate instruc- tion set architecture(ISA) is crucial in exploiting the potential DLP to gain high performance. The main objective of this thesis is to develop a methodology for synthesizing customized vector ISAs for various application domains targeting high performance program execution. In order to accomplish this objective, a number of applications from the telecommunication and linear algebra do- mains have been studied, and custom vector instructions sets have CE-MS-2007-04 been synthesized. Three algorithms that compute the shortest paths in a directed graph (Dijkstra, Floyd and Bellman-Ford) have been analyzed, along with the widely used Linpack floating point bench- mark. The framework used to customize the ISAs included the use of the Gnu C Compiler versions 4.1.2 and 2.7.2.3 and the SimpleScalar-3.0d tool set extended to simulate customized vector units. The modi¯cations applied to the simulator include the addition of a vector reg- ister ¯le, vector functional units and speci¯c vector instructions. The main results of this thesis can be summarized as follows: overall applications speedups of 24.88X for Dijkstra (after both code optimization and vectorization), 4.99X for Floyd, 9.27X for Bellman-Ford and 4.33X for the C version of Linpack. The above results suggest a consistent improvement in execution times due to the customized vector instruction sets. Faculty of Electrical Engineering, Mathematics and Computer Science Customizing Vector Instruction Set Architectures THESIS submitted in partial ful¯llment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER ENGINEERING by C¸at¸alinBogdan CIOBANU born in Bra»sov, Rom^ania Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Customizing Vector Instruction Set Architectures by C¸at¸alinBogdan CIOBANU Abstract Data Level Parallelism(DLP) can be exploited in order to improve the performance of processors for certain workload types. There are two main application ¯elds that rely on DLP, multimedia and scienti¯c computing. Most of the existing multimedia vector extensions use sub-word parallelism and wide data paths for processing independent, mainly integer, values in parallel. On the other hand, classic vector supercomputers rely on e±cient processing of large arrays of floating point numbers typically found in scienti¯c applications. In both cases, the selection of an appropriate instruction set architecture(ISA) is crucial in exploiting the potential DLP to gain high performance. The main objective of this thesis is to develop a methodology for synthesizing customized vector ISAs for various application domains targeting high performance program execution. In order to accomplish this objective, a number of applications from the telecommunication and linear algebra domains have been studied, and custom vec- tor instructions sets have been synthesized. Three algorithms that compute the shortest paths in a directed graph (Dijkstra, Floyd and Bellman-Ford) have been analyzed, along with the widely used Linpack floating point benchmark. The framework used to cus- tomize the ISAs included the use of the Gnu C Compiler versions 4.1.2 and 2.7.2.3 and the SimpleScalar-3.0d tool set extended to simulate customized vector units. The mod- i¯cations applied to the simulator include the addition of a vector register ¯le, vector functional units and speci¯c vector instructions. The main results of this thesis can be summarized as follows: overall applications speedups of 24.88X for Dijkstra (after both code optimization and vectorization), 4.99X for Floyd, 9.27X for Bellman-Ford and 4.33X for the C version of Linpack. The above results suggest a consistent improvement in execution times due to the customized vector instruction sets. i Laboratory : Computer Engineering Codenumber : CE-MS-2007-04 Committee Members : Advisor: Georgi Gaydadjiev, CE, TU Delft Advisor: Georgi Kuzmanov, CE, TU Delft Chairperson: Stamatis Vassiliadis, CE, TU Delft Member: Stephan Wong, CE, TU Delft Member: Ren¶evan Leuken, CAS, TU Delft ii I dedicate this thesis to my brother, Cristian iii iv Contents List of Figures vii List of Tables ix Acknowledgements xi 1 Introduction 1 1.1 The need for Vector ISA extensions ...................... 1 1.2 Thesis objectives ................................ 3 1.3 Results Summary ................................ 4 1.4 Thesis organization ............................... 4 2 Multimedia extensions examples 5 2.1 The Intel extensions .............................. 5 2.1.1 MMXTM ................................ 5 2.1.2 Internet Streaming SIMD Extensions - SSE ............. 7 2.2 MIPS V and MIPS digital media extensions (MDMX) ........... 11 2.3 Sun's Visual Instruction Set (VIS) ...................... 13 2.4 AltiVecTM .................................... 16 2.5 Matrix Oriented Multimedia (MOM) ..................... 17 2.6 Complex Streamed Instructions (CSI) .................... 30 2.7 Chapter summary and conclusions ...................... 42 3 ISA customization framework 45 3.1 Steps for customizing the vector ISA ..................... 45 3.2 Considered architecture ............................ 46 3.3 Simulation enviroment ............................. 46 3.3.1 Inline assembly and GCC ....................... 46 3.3.2 The benchmarking process ...................... 50 3.4 Simulator extensions .............................. 51 3.4.1 The vector register ¯le ......................... 51 3.4.2 The vector functional units ...................... 51 3.4.3 The vector instructions ........................ 52 3.5 Default parameters used for performance simulations ............ 54 3.6 Chapter Summary ............................... 54 4 Applications description 55 4.1 Dijkstra ..................................... 55 4.1.1 Pro¯le information ........................... 56 4.1.2 Dijkstra kernel ............................. 56 v 4.2 Floyd ...................................... 57 4.2.1 Pro¯ling information .......................... 57 4.2.2 Floyd kernel ............................... 57 4.3 Bellman-Ford .................................. 58 4.3.1 Pro¯ling information .......................... 58 4.3.2 Bellman-Ford kernel .......................... 58 4.4 Linpack ..................................... 60 4.4.1 Pro¯ling information .......................... 60 4.4.2 Linpack kernel ............................. 61 4.5 Chapter summary ............................... 61 5 Experimental results 63 5.1 The customized vector instruction sets .................... 63 5.1.1 General purpose vector instructions ................. 63 5.1.2 Application speci¯c instructions ................... 68 5.2 Simulation results ............................... 75 5.2.1 Dijkstra ................................. 75 5.2.2 Floyd .................................. 78 5.2.3 Bellman-Ford .............................. 82 5.2.4 Linpack ................................. 86 5.3 Summary of the results ............................ 91 6 Conclusions 99 Bibliography 105 A SimpleScalar 107 A.1 A short introduction to SimpleScalar ..................... 107 A.2 Default SimpleScalar con¯guration ...................... 108 vi List of Figures 2.1 CSI data path .................................. 33 2.2 CSI Memory Interface Unit .......................... 34 2.3 CSI:Gate implementation for the ¯rst step of SAD ............. 41 3.1 General Vector architecture .......................... 47 3.2 Vector Register File architecture ....................... 48 5.1 Dijkstra execution time when varying the memory latency ......... 78 5.2 Dijkstra speedup when varying the memory latency 1 ........... 79 5.3 Dijkstra speedup when varying the memory latency 2 ........... 80 5.4 Dijkstra execution time when varying the section size ........... 81 5.5 Dijkstra speedup when varying the section size 1 .............. 82 5.6 Dijkstra speedup when varying the section size 2 .............. 83 5.7 Floyd execution time when varying the memory latency .......... 86 5.8 Floyd speedup when varying the memory latency .............. 87 5.9 Floyd execution time when varying the section size ............. 88 5.10 Floyd speedup when varying the section size ................. 89 5.11 Bellman-Ford execution time when varying the memory latency ...... 90 5.12 Bellman-Ford speedup when varying the memory latency ......... 91 5.13 Bellman-Ford execution time when varying the section size ........ 92 5.14 Bellman-Ford speedup when varying the section size ............ 93 5.15 Linpack execution time when varying the memory latency ......... 94 5.16 Linpack speedup when varying the memory latency ............. 95 5.17 Linpack execution time when varying the section size ............ 96 5.18 Linpack speedup