A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO MATEUS BECK RUTZIG A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation Thesis presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science Prof. Dr. Luigi Carro Advisor Porto Alegre January/2012 CIP – CATALOGAÇÃO NA PUBLICAÇÃO Beck Rutzig, Mateus A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation/Mateus Beck Rutzig – Porto Alegre: Programa de Pós-Graduação em Computação, 2012. 119 p.:il. Tese (doutorado) – Universidade Federal do Rio Grande do Sul. Programa de Pós-Graduação em Computação. Porto Alegre, BR – RS, 2012. Orientador: Luigi Carro. 1.Sistemas Multiprocessados 2.Arquiteturas Reconfiguráveis 3.Sistemas Embarcados I. Carro, Luigi II. Título UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Carlos Alexandre Netto Vice-Reitor: Prof. Rui Vicente Oppermann Pró-Reitora de Pós-Graduação: Prof. Aldo Bolten Lucion Diretor do Instituto de Informática: Prof. Luís da Cunha Lamb Coordenador do PPGC: Prof. Alvaro Freitas Moreira Bibliotecária-Chefe do Instituto de Informática: Beatriz Regina Bastos Haro 3 TABLE OF CONTENTS 1 INTRODUCTION ........................................................................................... 13 1.1 Contributions ....................................................................................................... 19 2 RELATED WORK ......................................................................................... 21 2.1 Single-Threaded Reconfigurable Systems ......................................................... 21 2.2 Multiprocessing Systems ..................................................................................... 25 2.3 Multi-Threaded Reconfigurable Systems .......................................................... 30 2.4 The Proposed Approach ...................................................................................... 41 3 ANALYTICAL MODEL ................................................................................. 43 3.1 Performance Comparison ................................................................................... 44 3.1.1 Low End Single Processor .................................................................................. 44 3.1.2 High End Single Processor ................................................................................. 44 3.1.3 High-End Single Processor versus Homogeneous Multiprocessor Chip ............ 45 3.1.4 Applying the Performance Modeling in Real Processors ................................... 47 3.1.5 Communication Modeling in Multiprocessing Systems ..................................... 48 3.1.6 Applying the Performance Modeling in Real Processors considering the Communication Overhead .............................................................................................. 50 3.2 Energy Comparison ............................................................................................. 53 3.2.1 Applying the Energy Modeling in Real Processors ............................................ 53 3.2.2 Communication Modeling in Energy of Multiprocessing Systems .................... 54 3.2.3 Applying the Energy Modeling in Real Processors considering the Communication Overhead for Multiprocessing Systems ............................................... 55 3.3 Example of a Application Parallelization Process in a Multiprocessing System 57 4 CREAMS ....................................................................................................... 61 4.1 Dynamic Adaptive Processor (DAP) .................................................................. 61 4.1.1 Processor Pipeline (Block 2) ............................................................................... 61 4.1.2 Reconfigurable Data Path Structure (Block 1) ................................................... 61 4.1.3 Dynamic Detection Hardware (Block 4) ............................................................ 63 4.1.4 Storage Components (Block 3) ........................................................................... 67 5 RESULTS ...................................................................................................... 69 5.1 Methodology ......................................................................................................... 69 5.1.1 Benchmarks ........................................................................................................ 69 5.1.2 Simulation Environment ..................................................................................... 70 5.1.3 VHDL descriptions ............................................................................................. 71 4 5.1.4 How does the thread synchronization work? ...................................................... 72 5.1.5 Organization of this Chapter ............................................................................... 73 5.2 The Potential of CReAMS ................................................................................... 74 5.2.1 Considering the Same Chip Area ........................................................................ 75 5.2.2 Considering the Power Budget ........................................................................... 78 5.2.3 Energy-Delay Product ......................................................................................... 79 5.3 The impact of Inter-thread Communication ..................................................... 80 5.3.1 Considering the Same Chip Area ........................................................................ 81 5.3.2 Considering the Power Budget ........................................................................... 86 5.3.3 Energy-Delay Product ......................................................................................... 87 5.4 Heterogeneous Organization CReAMS ............................................................. 89 5.4.1 Methodology ....................................................................................................... 89 5.5 CReAMS versus Out-Of-Order Superscalar SparcV8 .................................... 97 6 CONCLUSIONS AND FUTURE WORKS .................................................. 101 6.1 Future Works ..................................................................................................... 101 6.1.1 Scheduling Algorithm ....................................................................................... 101 6.1.2 Studies over TLP and ILP considering the Operating System ......................... 102 6.1.3 Behavioral of CReAMS on a Multitask Environment ...................................... 102 6.1.4 Automatic CReAMS generation ....................................................................... 102 6.1.5 Area reductions by applying the Data Path Virtualization Strategy ................. 102 6.1.6 Boosting TLP performance with Heterogeneous Multithread CReAMS ......... 103 7 PUBLICATIONS ......................................................................................... 105 7.1 Book Chapters .................................................................................................... 105 7.2 Journals ............................................................................................................... 105 7.3 Conferences ........................................................................................................ 105 APPENDIX A.................................................................................................... 111 Introdução ................................................................................................................... 111 Objetivos ...................................................................................................................... 112 CReAMS ...................................................................................................................... 113 DAP .............................................................................................................................. 113 Metodologia ................................................................................................................. 117 Resultados ................................................................................................................... 118 Conclusões ................................................................................................................... 119 5 LIST OF ABBREVIATIONS AND ACRONYMS ALU Arithmetic and Logic Unit ARM Advanced RISC Machine ASIC Application Specific Integrated Circuits BT Binary Translation CAD Computer Aided Design CCA Configurable Compute Array DIM Dynamic Instruction Merging DSP Digital Signal Processor FIFO First In, First Out FPGA Field Programmable Gate Array ILP Instruction Level Parallelism TLP Thread Level Parallelism IPC Instructions per Cycle RAW Read After Write RFU Reconfigurable Functional Unit RPU Reconfigurable Processor Unit SIMD Single Instruction – Multiple Data 6 7 LIST OF FIGURES Figure 1. Different Architectures and Organizations ..................................................... 16 Figure 2. Speedup of homogeneous multiprocessing systems on embedded applications ........................................................................................................................................ 17 Figure 3. Coupling setups (HAUCK e COMPTON, 2002) ...........................................

A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support