A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO MATEUS BECK RUTZIG A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation Thesis presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science Prof. Dr. Luigi Carro Advisor Porto Alegre January/2012 CIP – CATALOGAÇÃO NA PUBLICAÇÃO Beck Rutzig, Mateus A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation/Mateus Beck Rutzig – Porto Alegre: Programa de Pós-Graduação em Computação, 2012. 119 p.:il. Tese (doutorado) – Universidade Federal do Rio Grande do Sul. Programa de Pós-Graduação em Computação. Porto Alegre, BR – RS, 2012. Orientador: Luigi Carro. 1.Sistemas Multiprocessados 2.Arquiteturas Reconfiguráveis 3.Sistemas Embarcados I. Carro, Luigi II. Título UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Carlos Alexandre Netto Vice-Reitor: Prof. Rui Vicente Oppermann Pró-Reitora de Pós-Graduação: Prof. Aldo Bolten Lucion Diretor do Instituto de Informática: Prof. Luís da Cunha Lamb Coordenador do PPGC: Prof. Alvaro Freitas Moreira Bibliotecária-Chefe do Instituto de Informática: Beatriz Regina Bastos Haro 3 TABLE OF CONTENTS 1 INTRODUCTION ........................................................................................... 13 1.1 Contributions ....................................................................................................... 19 2 RELATED WORK ......................................................................................... 21 2.1 Single-Threaded Reconfigurable Systems ......................................................... 21 2.2 Multiprocessing Systems ..................................................................................... 25 2.3 Multi-Threaded Reconfigurable Systems .......................................................... 30 2.4 The Proposed Approach ...................................................................................... 41 3 ANALYTICAL MODEL ................................................................................. 43 3.1 Performance Comparison ................................................................................... 44 3.1.1 Low End Single Processor .................................................................................. 44 3.1.2 High End Single Processor ................................................................................. 44 3.1.3 High-End Single Processor versus Homogeneous Multiprocessor Chip ............ 45 3.1.4 Applying the Performance Modeling in Real Processors ................................... 47 3.1.5 Communication Modeling in Multiprocessing Systems ..................................... 48 3.1.6 Applying the Performance Modeling in Real Processors considering the Communication Overhead .............................................................................................. 50 3.2 Energy Comparison ............................................................................................. 53 3.2.1 Applying the Energy Modeling in Real Processors ............................................ 53 3.2.2 Communication Modeling in Energy of Multiprocessing Systems .................... 54 3.2.3 Applying the Energy Modeling in Real Processors considering the Communication Overhead for Multiprocessing Systems ............................................... 55 3.3 Example of a Application Parallelization Process in a Multiprocessing System 57 4 CREAMS ....................................................................................................... 61 4.1 Dynamic Adaptive Processor (DAP) .................................................................. 61 4.1.1 Processor Pipeline (Block 2) ............................................................................... 61 4.1.2 Reconfigurable Data Path Structure (Block 1) ................................................... 61 4.1.3 Dynamic Detection Hardware (Block 4) ............................................................ 63 4.1.4 Storage Components (Block 3) ........................................................................... 67 5 RESULTS ...................................................................................................... 69 5.1 Methodology ......................................................................................................... 69 5.1.1 Benchmarks ........................................................................................................ 69 5.1.2 Simulation Environment ..................................................................................... 70 5.1.3 VHDL descriptions ............................................................................................. 71 4 5.1.4 How does the thread synchronization work? ...................................................... 72 5.1.5 Organization of this Chapter ............................................................................... 73 5.2 The Potential of CReAMS ................................................................................... 74 5.2.1 Considering the Same Chip Area ........................................................................ 75 5.2.2 Considering the Power Budget ........................................................................... 78 5.2.3 Energy-Delay Product ......................................................................................... 79 5.3 The impact of Inter-thread Communication ..................................................... 80 5.3.1 Considering the Same Chip Area ........................................................................ 81 5.3.2 Considering the Power Budget ........................................................................... 86 5.3.3 Energy-Delay Product ......................................................................................... 87 5.4 Heterogeneous Organization CReAMS ............................................................. 89 5.4.1 Methodology ....................................................................................................... 89 5.5 CReAMS versus Out-Of-Order Superscalar SparcV8 .................................... 97 6 CONCLUSIONS AND FUTURE WORKS .................................................. 101 6.1 Future Works ..................................................................................................... 101 6.1.1 Scheduling Algorithm ....................................................................................... 101 6.1.2 Studies over TLP and ILP considering the Operating System ......................... 102 6.1.3 Behavioral of CReAMS on a Multitask Environment ...................................... 102 6.1.4 Automatic CReAMS generation ....................................................................... 102 6.1.5 Area reductions by applying the Data Path Virtualization Strategy ................. 102 6.1.6 Boosting TLP performance with Heterogeneous Multithread CReAMS ......... 103 7 PUBLICATIONS ......................................................................................... 105 7.1 Book Chapters .................................................................................................... 105 7.2 Journals ............................................................................................................... 105 7.3 Conferences ........................................................................................................ 105 APPENDIX A.................................................................................................... 111 Introdução ................................................................................................................... 111 Objetivos ...................................................................................................................... 112 CReAMS ...................................................................................................................... 113 DAP .............................................................................................................................. 113 Metodologia ................................................................................................................. 117 Resultados ................................................................................................................... 118 Conclusões ................................................................................................................... 119 5 LIST OF ABBREVIATIONS AND ACRONYMS ALU Arithmetic and Logic Unit ARM Advanced RISC Machine ASIC Application Specific Integrated Circuits BT Binary Translation CAD Computer Aided Design CCA Configurable Compute Array DIM Dynamic Instruction Merging DSP Digital Signal Processor FIFO First In, First Out FPGA Field Programmable Gate Array ILP Instruction Level Parallelism TLP Thread Level Parallelism IPC Instructions per Cycle RAW Read After Write RFU Reconfigurable Functional Unit RPU Reconfigurable Processor Unit SIMD Single Instruction – Multiple Data 6 7 LIST OF FIGURES Figure 1. Different Architectures and Organizations ..................................................... 16 Figure 2. Speedup of homogeneous multiprocessing systems on embedded applications ........................................................................................................................................ 17 Figure 3. Coupling setups (HAUCK e COMPTON, 2002) ...........................................

A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation

Arm Arhitektura Napreduje – No Postoje Izazovi

Installation and Upgrade General Checklist Report

Fujitsu's Vision for High Performance Computing

Computer Architectures an Overview

Microprocessor

Guía Sobre Tiflotecnología Y Tecnología De Apoyo Para Uso

PAP Advanced Computer Architectures 1 ISA Development History

Sparc64 V/Vi の高性能，高信頼技術

Domain Time II Configuration Settings Using the Utilities About Settings (.Reg) Files on This Page

Architektura Procesorů Ultrasparc

SPARC64 V Microprocessor Provides Foundation for PRIMEPOWER Performance and Reliability Leadership

SPARC64™ VI Extensions