Institutionen För Systemteknik Department of Electrical Engineering

Institutionen för systemteknik Department of Electrical Engineering Examensarbete Algorithm adaptation and optimization of a novel DSP vector co-processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Andréas Karlsson LiTH-ISY-EX--10/4372--SE Linköping 2010 Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping Algorithm adaptation and optimization of a novel DSP vector co-processor Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet av Andréas Karlsson LiTH-ISY-EX--10/4372--SE Handledare: Olof Kraigher isy, Linköpings universitet Examinator: Dake Liu isy, Linköpings universitet Linköping, 18 June, 2010 Avdelning, Institution Datum Division, Department Date Division of Computer Engineering Department of Electrical Engineering 2010-06-18 Linköpings universitet SE-581 83 Linköping, Sweden Språk Rapporttyp ISBN Language Report category — Svenska/Swedish Licentiatavhandling ISRN Engelska/English Examensarbete LiTH-ISY-EX--10/4372--SE C-uppsats Serietitel och serienummer ISSN D-uppsats Title of series, numbering — Övrig rapport URL för elektronisk version http://www.da.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-57427 Titel Title Algorithm adaptation and optimization of a novel DSP vector co-processor Författare Andréas Karlsson Author Sammanfattning Abstract The Division of Computer Engineering at Linköping’s university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels. This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors’ microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn’t allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow. Nyckelord Keywords DSP, SIMD, ePUMA, real-time, embedded, matrix, QR, LU, inverse, determinant, master-multi-SIMD, parallel algorithms, matrix algebra Abstract The Division of Computer Engineering at Linköping’s university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels. This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors’ microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn’t allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow. v Acknowledgments First of all, I would like to thank professor Dake Liu for the opportunity to do this master thesis. I would also like to thank my supervisor Olof Kraigher for many discussions and supervision of my work. Finally, I would like to thank the rest of the research team, as well as the staff and master students at the division of Computer Engineering, that in one way or another has contributed to this work or just kept me company. The work has been a fun and rewarding experience. vii Contents 1 Introduction 3 1.1 Background . .3 1.2 Purpose . .4 1.3 Scope . .5 1.4 Outline . .5 2 The ePUMA architecture 7 2.1 Architecture overview . .8 2.2 Master DSP processor . .8 2.3 On-chip Network . .9 2.4 Direct Memory Access . 10 3 Sleipnir SIMD co-processor 11 3.1 Overview . 11 3.2 Data vectors . 12 3.3 The datapath . 12 3.4 Memory subsystem . 15 3.4.1 Program memory . 15 3.4.2 Constant memory . 15 3.4.3 Local vector memories . 16 3.4.3.1 Memory organization . 16 3.4.3.2 Data permutations . 17 3.4.3.3 Addressing . 18 3.4.3.4 Access modes . 19 3.5 Assembly instruction set . 19 3.5.1 Instruction format . 19 3.5.1.1 Single instruction iterations . 20 3.5.1.2 Conditions . 20 3.5.1.3 Instruction options . 21 3.6 The Sleipnir pipeline . 21 4 Matrix algebra and algorithm derivation 25 4.1 QR Decomposition . 25 4.1.1 Methods for QR Decomposition . 25 4.1.2 Gram-Schmidt orthogonalization . 26 ix x Contents 4.1.3 Rectangular matrices . 27 4.1.4 Algorithm . 28 4.2 LU Decomposition . 29 4.2.1 Calculation . 29 4.2.2 Memory storage . 30 4.2.3 Algorithm . 31 4.3 Matrix determinant . 32 4.3.1 Algorithm . 32 4.4 Matrix inversion . 33 4.4.1 Triangular matrices . 33 4.4.2 General matrices . 35 5 Algorithm evaluation 37 5.1 Fixed-point calculations . 37 5.2 Input data scaling . .√ . 38 5.3 Precision of 1/x and 1/ x ....................... 39 5.4 Matrix condition number . 39 5.5 Test data generation . 40 5.6 QR decomposition . 40 5.6.1 Performance . 40 5.6.2 Data formats . 42 5.6.3 Error analysis . 42 5.7 LU decomposition . 45 5.7.1 Performance . 45 5.7.2 Data formats . 46 5.7.3 Error analysis . 47 5.8 Matrix determinant . 47 5.8.1 Dynamic range handling . 47 5.8.2 Error analysis . 48 5.9 Matrix inverse . 49 5.9.1 Performance . 49 5.9.2 Data formats . 49 5.9.3 Error analysis . 51 6 Implementation 53 6.1 Assembly code generation . 53 6.2 Memory storage . 54 6.3 QR decomposition . 54 6.3.1 Small-size QRD . 55 6.3.2 Large-size QRD . 56 6.3.3 Loop unrolling . 57 6.3.4 Removing control-overhead . 58 6.3.5 Using a memory ping-pong approach . 59 6.4 LU decomposition . 59 6.4.1 Row pivoting . 60 6.4.2 Datapath lane masking . 60 Contents xi 6.4.3 2D addressing . 60 6.5 Matrix determinant . 61 6.6 Matrix inverse . 61 6.6.1 Triangular matrix inversion . 61 6.6.2 General matrix inversion . 62 6.7 Verification . 62 7 Results 63 7.1 QR decomposition . 63 7.2 LU decomposition . 64 7.3 Matrix determinant . 66 7.4 Matrix inverse . 66 8 Architecture Evaluation 69 8.1 Data dependency issues . 69 8.2 Pipeline hazards . 69 8.3 Buffered memory write-back . 70 8.4 Vector register file size . 72 8.5 Operations on datapath-unaligned vectors . 72 8.6 Hardware accelerated searching . 73 8.7 Register dependent iterations . 73 8.8 Complex division . 73 8.9 Simultaneous constant memory accesses . 74 9 Conclusions 75 9.1 Future work . 75 Bibliography 77 List of abbreviations AGU Address generation unit ALU Arithmetic logic unit CPU Central processing unit DCT Discrete cosine transform DMA Direct memory access DSP Digital signal processor/processing FIFO First in first out FSM Finite state machine LUD LU decomposition MAC Multiply and accumulate MGS Modified Gram-Schmidt OCN On-chip network PC Program counter PM Program memory RISC Reduced instruction set computer RTL Register transfer level SIMD Single instruction multiple data SVD Singular value decomposition VLIW Very long instruction word 1 Chapter 1 Introduction 1.1 Background Since the very first day a computer became a useful tool, there has been an increasing demand for more computational power. The reason is simple: more power makes it possible to do more things, solve larger problems and simplify many common tasks that is very tedious or even impossible to do by hand. From the very first computers that filled large rooms, we have now progressed to a stage where it is possible to build a computer that can be hand-held or even smaller, that has far greater performance than was possible in the early days. A key invention for the rapid progress of computers is the transistor, which has made it possible to build integrated circuits, that are absolutely essential for the miniaturization and performance of a modern computer. The “brain” of the computer, the CPU, is a key component in a computer and its.

Institutionen För Systemteknik Department of Electrical Engineering

Triangular Factorization

An Efficient Implementation of the Thomas-Algorithm for Block Penta

Pivoting for LU Factorization

Mergesort / Quicksort Steven Skiena

4. Linear Equations with Sparse Matrices 4.1 General Properties of Sparse Matrices

7 Gaussian Elimination and LU Factorization

Mixing LU and QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers✩

Maintaining LU Factors of a General Sparse Matrix*+ Department of Operations Research Stanford Uniuersity Stanfmd, Califmiu 9430

2 Partial Pivoting, LU Factorization

Some P-RAM Algorithms for Sparse Linear Systems

The Cholesky Factorization in Interior Point Methods

Why Sparse Matrix?