UC San Diego Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title GUSTO : general architecture design utility and synthesis tool for optimization Permalink https://escholarship.org/uc/item/49j6x25v Author İrtürk, Ali Umut Publication Date 2009 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO GUSTO: General architecture design Utility and Synthesis Tool for Optimization A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Ali Umut Irt¨˙ urk Committee in charge: Ryan Kastner, Chair Jung Uk Cho Bhaskar Rao Timothy Sherwood Steven Swanson Dean Tullsen 2009 Copyright Ali Umut Irt¨˙ urk, 2009 All rights reserved. The dissertation of Ali Umut Irt¨˙ urk is approved, and it is acceptable in quality and form for publi- cation on microfilm and electronically: Chair University of California, San Diego 2009 iii DEDICATION To my family (Jale,˙ 0mer,¨ B¨ulent, Topsi) iv EPIGRAPH Let nothing perturb you, nothing frighten you. All things pass. God does not change. Patience achieves everything. —Mother Teresa v TABLE OF CONTENTS Signature Page................................... iii Dedication ................................... iv Epigraph ................................... v Table of Contents................................... vi List of Figures ................................... x List of Tables ...................................xviii Acknowledgements................................... xix Vita and Publications................................... xxi Abstract of the................................... Dissertation xxiii Chapter 1 Introduction . ........................... 1 1.1 Motivation .......................... 1 1.2 Research Overview ..................... 3 1.3 Organization of Dissertation ................ 6 Chapter 2 Parallel Platforms for Matrix Computation Algorithms .... 9 2.1 Graphic Processing Units (GPUs) ............. 11 2.2 Massively Parallel Processor Arrays (MPPAs) ...... 17 2.2.1 Ambric Family Overview .............. 20 2.2.2 Ambric AM2045 Architecture ........... 22 2.3 Field Programmable Gate Arrays (FPGAs) ........ 23 2.3.1 Xilinx Virtex-4 Family Overview .......... 26 2.3.2 Xilinx Virtex-4 Architecture ............ 26 2.4 Learning from Existing Parallel Platforms ........ 27 2.4.1 Comparison of Parallel Platforms . ...... 28 2.4.2 Roadmap for the Future Many-Core Platform . 35 Chapter 3 Overview of Design Tools . .................. 38 3.1 System Generator for DSP ................. 39 3.2 AccelDSP Synthesis Tool .................. 46 3.3 Simulink HDL Coder .................... 51 3.4 C-based High Level Design Tools ............. 55 3.5 A Case Study for Filter Designs using Domain Specific High Level Design Tools .................. 62 vi 3.5.1 Filter Design HDL Coder Toolbox ......... 63 3.5.2 Xquasher ...................... 67 3.5.3 Comparison: Filter HDL Toolbox versus Xquasher 71 3.6 Roadmap for the Future Single/Many-Core Platform Gen- erator Tool .......................... 78 Chapter 4 Matrix Computations: Matrix Multiplication, Matrix Decom- position and Matrix Inversion .................. 83 4.1 Building Blocks of Matrix Computations ......... 84 4.2 Matrix Decomposition and its Methods .......... 91 4.2.1 QR Decomposition ................. 91 4.2.2 LU Decomposition . ............... 93 4.2.3 Cholesky Decomposition .............. 94 4.3 Matrix Inversion and its Methods ............. 96 4.3.1 Matrix Inversion of Triangular Matrices ...... 97 4.3.2 QR Decomposition Based Matrix Inversion .... 98 4.3.3 LU Decomposition Based Matrix Inversion .... 98 4.3.4 Cholesky Decomposition Based Matrix Inversion . 99 4.3.5 Matrix Inversion using Analytic Method .....100 Chapter 5 GUSTO: General architecture design Utility and Synthesis Tool for Optimization .......................101 5.1 Flow of GUSTO .......................103 5.1.1 Error Analysis ....................107 5.2 Matrix Decomposition Architectures . ...........110 5.2.1 Inflection Point Analysis ..............111 5.2.2 Architectural Design Alternatives for Matrix De- composition Algorithms ..............113 5.3 Adaptive Weight Calculation Core using QRD-RLS Al- gorithm . ........................115 5.3.1 Comparison .....................116 5.4 Matrix Inversion Architectures ...............117 5.4.1 Inflection Point Analysis ..............118 5.4.2 Architectural Design Alternatives for Matrix In- version Architectures ................124 5.4.3 Comparison .....................126 5.5 Conclusion ..........................127 Chapter 6 GUSTO’s Single Processing Core Architecture .........129 6.1 Related Work . .....................130 6.2 Automatic Generation and Optimization of Matrix Com- putation Architectures ...................133 6.2.1 Flow of Operation ..................134 vii 6.2.2 Designing the General Purpose Processing Core . 135 6.2.3 Designing the Application Specific processing core 137 6.3 Designing a Multi-Core Architecture ...........145 6.3.1 Partitioning .....................147 6.3.2 Generation of the Connectivity Between Cores . 149 6.4 Conclusion ..........................150 Chapter 7 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths ...........151 7.1 Hierarchical Datapaths Implementation and Heterogeneous Architecture Generation using GUSTO ..........154 7.1.1 Hardware Implementation Trade-offs of Matrix Com- putation Architectures using Hierarchical Datap- aths .........................155 7.1.2 Flow of GUSTO for Multi-Core Designs .....157 7.2 Architectural Implementation Results of Different Matrix Computation Algorithms . ...............164 7.2.1 Matrix Multiplication ................166 7.2.2 Matrix Inversion ..................179 7.3 Conclusion ..........................186 Chapter 8 FPGA Acceleration of Mean Variance Framework for Optimal Asset Allocation ..........................190 8.1 The Mean Variance Framework for Optimal Asset Allo- cation ............................194 8.1.1 Computation of the Required Inputs .......195 8.1.2 Mean Variance Framework Step 1: Computation of the Efficient Frontier ...............198 8.1.3 Mean Variance Framework Step 2: Computing the Optimal Allocation ...............199 8.2 Implementation of the Mean Variance Framework ....201 8.2.1 Implementation Motivation ............201 8.2.2 Hardware/Software Interface ............203 8.2.3 Generation of Required Inputs - Phase 5 .....205 8.2.4 Hardware Architecture for Mean Variance Frame- work Step 1 .....................205 8.2.5 Hardware Architecture for Mean Variance Frame- work Step 2 .....................208 8.3 Results ............................210 8.4 Conclusions .........................212 Chapter 9 Future Research Directions ....................214 viii Appendix A Matrix Computations .......................215 A.1 Matrix Decomposition Methods ..............215 A.1.1 QR Decomposition .................215 A.1.2 LU Decomposition .................226 A.1.3 Cholesky Decomposition ..............229 Bibliography ...................................233 ix LIST OF FIGURES Figure 1.1: Design Flow of GUSTO. ...................... 4 Figure 1.2: Flow of GUSTO’s trimming feature. ............... 5 Figure 1.3: Hardware implementation of matrix multiplication architectures with different design methods using GUSTO. .......... 6 Figure 2.1: GPU Architecture. ......................... 15 Figure 2.2: The design flow using CUDA for NVIDIA GPUs. ........ 16 Figure 2.3: Multi-core CPU architecture. .................. 18 Figure 2.4: Structural object programming model for Ambric Architecture. 19 Figure 2.5: Design Flow of Ambric MPPA and its steps: Structure, Code, Reuse, Verify, Realize and Test. .................. 20 Figure 2.6: Ambric Architecture. ........................ 21 Figure 2.7: An FPGA architecture and its resources: I/O cells, logic blocks (CLBs) and interconnects. ..................... 24 Figure 2.8: Xilinx ISE design flow and its steps: design entry, design syn- thesis, design implementation and Xilinx device programming. Design verification occurs at different steps during the design flow.................................. 25 Figure 3.1: Two simple design examples using System Generator for DSP are presented: multiply & accumulate and FIR filter. ...... 40 Figure 3.2: Example blocks from System Generators’ library. ........ 42 Figure 3.3: Design and implementation of the Matching Pursuits algorithm for channel estimation using System Generator. Even a small change in the architecture affects blocks inside the design and requires a large amount of manual synchronization efforts. 45 Figure 3.4: The design flow for AccelDSP. ................... 48 Figure 3.5: Comparison of AccelDSP/AccelWare and Hand-Code/Coregen implementations for various signal processing algorithms: FFT, 10×10 matrix multiplication, FIR filter, CORDIC and Constant False Alarm Rate (CFAR) [156]. Results are presented in terms of area and required calculation time. ............... 50 Figure 3.6: C-based high level design tools are divided into 5 different fami- lies including Open Standard: System-C; tools producing generic HDL that can target multiple platforms: Catapult-C, Impulse- C and Mitrion-C; tools producing generic HDL that are opti- mized for manufacturer’s hardware: DIME-C, Handel-C; tools that target a specific platform and/or configuration: Carte, SA- C, Streams-C; tool targeting RISC/FPGA hybrid architectures: Napa-C. ............................... 56 Figure 3.7: