Institutionen för systemteknik Department of Electrical Engineering

Examensarbete

Algorithm adaptation and optimization of a novel DSP vector co-processor

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av

Andréas Karlsson

LiTH-ISY-EX--10/4372--SE Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

Algorithm adaptation and optimization of a novel DSP vector co-processor

Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet av

Andréas Karlsson

LiTH-ISY-EX--10/4372--SE

Handledare: Olof Kraigher isy, Linköpings universitet Examinator: Dake Liu isy, Linköpings universitet

Linköping, 18 June, 2010

Avdelning, Institution Datum Division, Department Date

Division of Computer Engineering Department of Electrical Engineering 2010-06-18 Linköpings universitet SE-581 83 Linköping, Sweden

Språk Rapporttyp ISBN Language Report category —

 Svenska/Swedish  Licentiatavhandling ISRN  Engelska/English  Examensarbete LiTH-ISY-EX--10/4372--SE C-uppsats  Serietitel och serienummer ISSN D-uppsats  Title of series, numbering —   Övrig rapport  URL för elektronisk version http://www.da.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-57427

Titel Title Algorithm adaptation and optimization of a novel DSP vector co-processor

Författare Andréas Karlsson Author

Sammanfattning Abstract

The Division of Computer Engineering at Linköping’s university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels. This thesis investigates the performance potential of the co-processors by implementing algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors’ microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn’t allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

Nyckelord Keywords DSP, SIMD, ePUMA, real-time, embedded, matrix, QR, LU, inverse, determinant, master-multi-SIMD, parallel , matrix algebra

Abstract

The Division of Computer Engineering at Linköping’s university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels. This thesis investigates the performance potential of the co-processors by imple- menting matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors’ microar- chitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn’t allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

v

Acknowledgments

First of all, I would like to thank professor Dake Liu for the opportunity to do this master thesis. I would also like to thank my supervisor Olof Kraigher for many discussions and supervision of my work. Finally, I would like to thank the rest of the research team, as well as the staff and master students at the division of Computer Engineering, that in one way or another has contributed to this work or just kept me company. The work has been a fun and rewarding experience.

vii

Contents

1 Introduction 3 1.1 Background ...... 3 1.2 Purpose ...... 4 1.3 Scope ...... 5 1.4 Outline ...... 5

2 The ePUMA architecture 7 2.1 Architecture overview ...... 8 2.2 Master DSP processor ...... 8 2.3 On-chip Network ...... 9 2.4 Direct Memory Access ...... 10

3 Sleipnir SIMD co-processor 11 3.1 Overview ...... 11 3.2 Data vectors ...... 12 3.3 The datapath ...... 12 3.4 Memory subsystem ...... 15 3.4.1 Program memory ...... 15 3.4.2 Constant memory ...... 15 3.4.3 Local vector memories ...... 16 3.4.3.1 Memory organization ...... 16 3.4.3.2 Data permutations ...... 17 3.4.3.3 Addressing ...... 18 3.4.3.4 Access modes ...... 19 3.5 Assembly instruction set ...... 19 3.5.1 Instruction format ...... 19 3.5.1.1 Single instruction iterations ...... 20 3.5.1.2 Conditions ...... 20 3.5.1.3 Instruction options ...... 21 3.6 The Sleipnir pipeline ...... 21

4 Matrix algebra and algorithm derivation 25 4.1 QR Decomposition ...... 25 4.1.1 Methods for QR Decomposition ...... 25 4.1.2 Gram-Schmidt orthogonalization ...... 26

ix x Contents

4.1.3 Rectangular matrices ...... 27 4.1.4 Algorithm ...... 28 4.2 LU Decomposition ...... 29 4.2.1 Calculation ...... 29 4.2.2 Memory storage ...... 30 4.2.3 Algorithm ...... 31 4.3 Matrix determinant ...... 32 4.3.1 Algorithm ...... 32 4.4 Matrix inversion ...... 33 4.4.1 Triangular matrices ...... 33 4.4.2 General matrices ...... 35

5 Algorithm evaluation 37 5.1 Fixed-point calculations ...... 37 5.2 Input data scaling . .√ ...... 38 5.3 Precision of 1/x and 1/ x ...... 39 5.4 Matrix condition number ...... 39 5.5 Test data generation ...... 40 5.6 QR decomposition ...... 40 5.6.1 Performance ...... 40 5.6.2 Data formats ...... 42 5.6.3 Error analysis ...... 42 5.7 LU decomposition ...... 45 5.7.1 Performance ...... 45 5.7.2 Data formats ...... 46 5.7.3 Error analysis ...... 47 5.8 Matrix determinant ...... 47 5.8.1 Dynamic range handling ...... 47 5.8.2 Error analysis ...... 48 5.9 Matrix inverse ...... 49 5.9.1 Performance ...... 49 5.9.2 Data formats ...... 49 5.9.3 Error analysis ...... 51

6 Implementation 53 6.1 Assembly code generation ...... 53 6.2 Memory storage ...... 54 6.3 QR decomposition ...... 54 6.3.1 Small-size QRD ...... 55 6.3.2 Large-size QRD ...... 56 6.3.3 Loop unrolling ...... 57 6.3.4 Removing control-overhead ...... 58 6.3.5 Using a memory ping-pong approach ...... 59 6.4 LU decomposition ...... 59 6.4.1 Row pivoting ...... 60 6.4.2 Datapath lane masking ...... 60 Contents xi

6.4.3 2D addressing ...... 60 6.5 Matrix determinant ...... 61 6.6 Matrix inverse ...... 61 6.6.1 Triangular matrix inversion ...... 61 6.6.2 General matrix inversion ...... 62 6.7 Verification ...... 62

7 Results 63 7.1 QR decomposition ...... 63 7.2 LU decomposition ...... 64 7.3 Matrix determinant ...... 66 7.4 Matrix inverse ...... 66

8 Architecture Evaluation 69 8.1 Data dependency issues ...... 69 8.2 Pipeline hazards ...... 69 8.3 Buffered memory write-back ...... 70 8.4 Vector register file size ...... 72 8.5 Operations on datapath-unaligned vectors ...... 72 8.6 Hardware accelerated searching ...... 73 8.7 Register dependent iterations ...... 73 8.8 Complex division ...... 73 8.9 Simultaneous constant memory accesses ...... 74

9 Conclusions 75 9.1 Future work ...... 75

Bibliography 77

List of abbreviations

AGU Address generation unit

ALU Arithmetic logic unit CPU Central processing unit DCT Discrete cosine transform

DMA Direct memory access DSP Digital signal processor/processing FIFO First in first out FSM Finite state machine

LUD LU decomposition MAC Multiply and accumulate MGS Modified Gram-Schmidt OCN On-chip network

PC Program counter PM Program memory RISC Reduced instruction set computer

RTL Register transfer level SIMD Single instruction multiple data SVD Singular value decomposition VLIW Very long instruction word

1

Chapter 1

Introduction

1.1 Background

Since the very first day a computer became a useful tool, there has been an increasing demand for more computational power. The reason is simple: more power makes it possible to do more things, solve larger problems and simplify many common tasks that is very tedious or even impossible to do by hand. From the very first computers that filled large rooms, we have now progressed to a stage where it is possible to build a computer that can be hand-held or even smaller, that has far greater performance than was possible in the early days. A key invention for the rapid progress of computers is the transistor, which has made it possible to build integrated circuits, that are absolutely essential for the miniaturization and performance of a modern computer. The “brain” of the computer, the CPU, is a key component in a computer and its performance greatly effects the overall performance of the computer, since the CPU performs most of the actual computations. Most CPUs have for a very long time been designed using a synchronized approach, in which a global clock signal synchronizes the events of the entire CPU chip. In each clock cycle the state of the CPU is progressed and computations are performed. An easy way to improve the performance of a CPU is therefore to increase the clock frequency. There are unfortunately several problems with this approach. First, there is a limit on how high the clock frequency can be, depending on the technology and the design, which is a result of the fact that electronic circuits have a certain delay. Secondly, the chip will draw more power, both due to the higher clock frequency, but also due to the higher voltage typically required to operate at a higher frequency [12]. Higher power dissipation might lead to problems regarding cooling of the chips, but in the case of battery-powered devices, also to shorter battery life. Since clock frequency ramping has turned out to not be the final answer to the increasing computational need, people have searched for other solutions. A very effective method is to utilize different kinds of parallelism. One such solution is SIMD extensions, which packs data into vectors and operate on all data in the vector in parallel. This solution has good performance potential, as long as

3 4 Introduction it is possible to formulate a program as a sequence of vector operations. Two other common solutions are superscalar and VLIW processors, both exploiting instruction level parallelism to dispatch data independent instructions to different computational blocks [15]. There is however a limit on the scaling of such an approach and such a processor tends to require quite a lot of power. A third approach worth mentioning is the multi-core approach, were several tasks can be run in parallel on independent processor cores. This approach is very effective if several independent tasks should be performed or if a task can be divided in several more or less independent sub-tasks. The division of computer engineering at Linköping’s university have for some time now researched the possibility to create a highly parallel DSP platform, ePUMA, suitable for real-time embedded applications. It should be able to handle the high computational demand of next-generation products, but at the same time come at a low cost and with low power requirements. Efficient hardware utilization is of key importance and a major problem that ePUMA addresses is how to effectively utilize the available memory bandwidth. Investigations have shown that many processors spend a significant amount of time just waiting for input data from memory, which results in poor utilization of the datapaths and a slower computer as a result. ePUMA’s potential for high computational power is a result of combining the SIMD and multi-core approaches with a unique memory access design and a unique parallel programming methodology. The intent is to hide memory access latency behind computations, by performing computations and data access in parallel [8].

1.2 Purpose

The ePUMA overall architecture has been decided and the project is currently in a instruction set benchmarking and final microarchitecture decision phase. A pipeline accurate simulator of the entire architecture has been implemented, but no RTL code has been written. The main reason for this is that final design decisions have not yet been made and some aspects of the architecture have not been fixed. Before making any final decisions, the architecture needs to be tested. This is done by running typical applications on the architecture simulator and evaluating the performance. By doing this, weak spots can be identified and the architecture can be improved in several iterations until satisfactory results are achieved. Testing is also of great help to evaluate the CPU instruction set. The instruction set greatly affects the performance of the CPU and since the architecture is not yet at RTL stage, it is easy to add new instructions and evaluate what benefits new instructions bring to the architecture. To improve the architecture it is very important to specify the intended use of the processor and identify the operations that need to be performed. Indented usage areas for ePUMA include for example baseband signal and radar signal processing, video codecs and video games, applications which heavily rely on matrix algebra routines. To make ePUMA a serious alternative for these areas, it is very important that the architecture can handle matrix manipulations effectively, 1.3 Scope 5 for example , conjugate transpose as well as more advanced decompositions like QR and LU decompositions. Evaluation of such matrix algebra routines is the main focus of this thesis. The purpose of this thesis can be summarized as:

• Adapting real and complex integer matrix algebra algorithms for parallel and predictable execution.

• Implementation of matrix algebra routines for the ePUMA processor to evaluate the performance and identify possible weak spots of the architecture.

• Suggest and possibly implement improvements to the ePUMA architecture and toolchain.

1.3 Scope

The number of possible matrix manipulations is very large. This thesis therefore concentrates on a few selected and well-used manipulations, namely general QR and LU decompositions for both square and rectangular matrices, as well as matrix determinant and matrix inverses for square matrices. This is done for both real-valued and complex-valued data. Like many other DSP processors, ePUMA focuses mainly on fixed-point calculations. There is currently no hardware floating- point support in ePUMA. Although ePUMA is intended to support both 8, 16 and 32 bit fixed-point data, this thesis focuses on 16-bit data. The reason is mainly that this is what is typically used in our target applications and presents a good compromise between precision, speed and hardware cost.

1.4 Outline

This section briefly describes the contents of the chapters in this thesis.

• Chapter 1 - Introduction: Describes the background, purpose and scope of this thesis.

• Chapter 2 - The ePUMA architecture: A brief overview of the ePUMA architecture will be presented, to bring the work into context.

• Chapter 3 - Sleipnir SIMD co-processor: This chapter will give a more detailed description of the Sleipnir SIMD co-processor, that is in focus in this thesis. The hardware will be described and an introduction to assembly programming for Sleipnir will be presented.

• Chapter 4 - Matrix algebra and algorithm derivation: Some matrix algebra theory will be presented, mainly focused on the matrix operations that will be investigated. Algorithms for performing these operations will be presented. 6 Introduction

• Chapter 5 - Algorithm evaluation: This chapter will investigate the algorithms under 16-bit fixed-point precision. We will both look at theoret- ical lower boundaries on computation time in Sleipnir, as well as investigate what precision we can expect on the computed results. • Chapter 6 - Implementation: The chapter focuses on the implementation strategies for the assembly kernels, both straight-forward implementations, but also what performance enhancing techniques can be applied. We will also discuss where possible sources of overhead can show up. • Chapter 7 - Results: Here we will have a look at the final execution time results of the implemented kernels and compare these results with the numbers that were derived in chapter 5. • Chapter 8 - Architecture evaluation: The chapter will evaluate the Sleipnir architecture for matrix algebra. We will mainly focus on the weak- nesses and suggest improvements that would decrease the overhead. • Chapter 9 - Conclusions and future work: The thesis will be concluded with some final discussion and suggestions for future work. The target audience for this thesis is master students with a background in electrical engineering or computer science. Chapter 2

The ePUMA architecture

The ePUMA architecture is a master-multi-SIMD DSP processor architecture. The general concept in such an architecture is to divide the computational tasks between a single master processor and several SIMD co-processors. The SIMD co- processors are specialized for execution of arithmetic operations on vector data and can therefore achieve very high throughput. The master processor is usually more of a general purpose processor and controls the execution of the entire processor, by handing out tasks to the SIMD co-processors and maintaining a control flow for the whole processor. The master processor can also be used to run code that cannot easily be mapped to vector operations and therefore is not suitable to run on the SIMDs. This can for example be the sequential part of an algorithm. The ePUMA architecture is an attempt to use the master-multi-SIMD concept for DSP applications. It does so by combining one RISC DSP processor and up to eight SIMD co-processors in a single chip. Another notable example, that uses the master-multi-SIMD approach is the Cell Broadband Engine architecture from Sony, IBM and Toshiba (STI), most known for being the main CPU for the Playstation 3TM game console [7]. The Cell processor, however, is aimed more at being a general purpose processor and has a less strict power budget. Since ePUMA is intended for real-time embedded applications, power and cost is much more of an issue. A main contributor to power consumption traditionally is that the available memory bandwidth is not used efficiently and this is compensated by an increased clock frequency. The ePUMA project aims at delivering both an architecture and a parallel programming methodology to deal with this issue, so that the highest possible ratio of arithmetic computing over all cost can be achieved. The key is to hide the cost of data accesses and control overhead behind the arithmetic computations, so that high datapath utilization can be achieved [8]. This chapter will present a brief overview of the entire architecture. The focus in this thesis is to implement computing kernels for the SIMD co-processors, so this will instead be presented more in detail in Chapter 3. It is still however important to have some knowledge about the entire architecture, in order to be able to create an efficient computing kernel and put everything into context.

7 8 The ePUMA architecture

2.1 Architecture overview

A master-multi-SIMD architecture is a quite modular architecture with many different parts. In focus is of course the master processor and the co-processors that perform all the actual computational work. The processor also needs some main storage space, which is supported by off-chip main memory. In order for the architecture to become flexible and efficient, it is also extremely important to have a good on-chip network, that has sufficient flexibility and throughput to not become a bottleneck of the whole design. A schematic overview can be found in figure 2.1. The main on-chip components include [4]:

• Master RISC DSP core A 16-bit dual-MAC DSP processor, containing two local data memories and a local instruction cache.

• Eight Sleipnir SIMD co-processors Sleipnir is an 8-way 16-bit SIMD DSP processor, meaning that it can operate on vectors with up to eight data elements simultaneously. Each SIMD co- processor has its own set of data memories and program memory. Sleipnir will be further described in chapter 3.

• DMA controller Used to handle block transfers of data between off-chip main memory and the internal memories.

• On-chip network A packet-based on-chip network that connects the SIMDs and the DMA controller.

A behavioral model of the whole architecture has been implemented as a cycle- true and pipeline accurate software simulator. It is currently possible to simulate the whole architecture, but also the master processor or a single SIMD co-processor separately.

2.2 Master DSP processor

The master processor in ePUMA is a RISC DSP processor and will in a typical environment be used for many different tasks, for example coordinating the SIMDs, executing smaller computational tasks, executing tasks that cannot be executed efficiently on parallel hardware and more. Because of this, it is very important that this processor is very flexible and can execute many different kinds of tasks efficiently. The master runs code that resides in main memory. The SIMD co-processors however only run code that reside in their own local program memories. It is the masters job to put a program, a computing kernel, in the co-processor’s program memory. The master issues tasks to the SIMDs by transferring computing kernels and input data from main memory to the SIMDs local program memory and 2.3 On-chip Network 9

Main memory

On-chip network

Master SIMD0 SIMD7 DMA Controller RISC SIMD SIMD Core Core Core

DM0 LVM0 LVM0 DM1 LVM1 LVM1 LVM2 LVM2 CM CM PM PM

Figure 2.1. ePUMA overview

data memories. A potential bottleneck of the architecture is that the master is overloaded and therefore slow at issuing new tasks to the SIMDs, leaving the SIMDs idle. This of course must be avoided. This transferring of tasks is also the reason why the master is more efficient to use for smaller tasks, since the overhead of setting up transfers could be slower than executing the task directly on the master. Also, if a task cannot be adapted to use the vector processing capabilities of the SIMDs, it is probably just a waste of resources to run it on a SIMD processor. Currently the master is a single issue DSP processor, meaning that it issues one machine instruction in every clock cycle. Since the performance of the master could be critical in some situations, it might be implemented as a superscalar processor in the future. Currently the only way of programming for the master processor is by assem- bly programming. Programming large control flows, in combination with small computing tasks and other things, will sure be a very tedious task in assembly language. There are plans for creating a compiler for a higher-level language in the future.

2.3 On-chip Network

The on-chip network allows direct transfer of data from one SIMD to another. This allows for powerful computational chains, were one SIMD can do some com- putations, transfer the results to another SIMD that does the next computational 10 The ePUMA architecture step and while this transfer is being done, the first SIMD can continue with a new set of input data, that has previously been loaded. By using this approach, we can achieve efficient data streaming among the SIMD co-processors and have the SIMDs running all the time. The network is monitored and controlled by the master processor, which does so by reading and writing to network node registers.

2.4 Direct Memory Access

As described earlier, the master is responsible for handing out tasks to the SIMDs. To avoid unnecessary load of the master this is done using DMA. The machine code for a SIMD program is initially kept in main memory. The master setups a DMA transaction to transfer that program from main memory to the SIMDs local program memory. Typically, input data is also needed, which also is transferred using DMA into the SIMDs local data memories. When the program code and input data is in place, execution in the SIMD can commence. Once finished, DMA is also used to read the output data from the SIMDs local data memories. When this is done, it is both possible to load a new task to the SIMD or just supply new input data and run the same task again. Chapter 3

Sleipnir SIMD co-processor

The SIMD co-processor in ePUMA is called Sleipnir and eight of these are intended to give ePUMA a huge amount of parallel processing power. The majority of the work presented in this thesis is about adapting matrix algebra algorithms to efficiently utilize the vector capabilities of Sleipnir. The performance is evaluated by simulating a single co-processor as a stand-alone processor. The assumption is then that the master processor has already transferred the computing kernel and input data to Sleipnir’s local memories, so that the computations can start right away. This chapter will both present a hardware overview of Sleipnir, as well as give an introduction to assembly programming for Sleipnir.

3.1 Overview

Sleipnir’s internal structure is depicted in figure 3.1. Here we find an 8-way datapath, that supports advanced manipulations of vectors containing 8, 16 or 32- bit real-valued or complex-valued fixed-point data. We also find several memories. The program memory (PM) contains the computing kernel at execution. The three local vector memories (LVMs) are the main data memories where all input, intermediate and output data is stored. The constant memory (CM) is not writable and can be used to store coefficients and other constant data needed by the computing kernel. We also find a number of registers, including the vector register file (VRF), that can contain eight data vectors, the vector flag register (VFLAG), containing flags set by computations, the vector accumulator register (VACR), used for MAC and MAC-like operations and a special register file (SRF). The SRF contains address registers (AR) for addressing the LVMs and constant address registers (CAR) for addressing the constant memory, with corresponding top and bottom registers for modulo addressing. In Sleipnir there is also hardware for supporting advanced data access patterns for the LVMs, as well as hardware for communicating with other Sleipnir co-processors and other parts of the ePUMA architecture. A main focus of Sleipnir is orthogonality. The idea is that computations, control and data access should be separated. This allows for greater freedom

11 12 Sleipnir SIMD co-processor

Sleipnir core PC PM

PC Instruction FSM decoder PPU

LVM & CM AGU CM

LVM0 LVM1 LVM2 VRF SRF

Permutation switch

Datapath

Figure 3.1. Sleipnir overview

while writing code and the possibility to run more things in parallel.

3.2 Data vectors

Sleipnir’s parallel computing power is a result of working on data vectors, which means that several scalar operations are performed in parallel. A data vector is simply several scalars packed together in a bundle. The scalar sizes that can operated on are either 8 bits (bytes), 16 bits (words) or 32 bits (double words or simply doubles). The data vector length in Sleipnir is 128 bits. This means that we can pack different amounts of scalars in a vector, depending on what scalar size we use. Also, in the case of complex numbers, we need to store both the real and the imaginary part of the scalar. A few examples of data vector formats are shown in figure 3.2.

3.3 The datapath

The datapath is the main computational block of the Sleipnir co-processor. It consists of eight 16 bit lanes, making it an 8-way processor. It’s intended to handle both simple arithmetic and logical operations as well as more complicated instructions. Supported operations can be classified as: 3.3 The datapath 13

Figure 3.2. Data vector formats

• Scalar-scalar operations: E.g. adding two scalars to produce a scalar.

• Vector-scalar operations: E.g. subtracting a scalar value from every element in a vector.

• Vector-vector operations: E.g. multiplying all elements of a vector element-wise with the elements of another vector

• Triangular operations: In contrast to the three variants above, triangular operations have dependencies between the lanes, e.g multiply two vectors element-wise and then accumulate all results with an adder tree, i.e. several consecutive MAC-operations in one instruction.

• Special function operations: Butterflies, Taylor series acceleration, DCT etc.

• Custom micro-coded operations: Custom operations controlled by micro- code programs.

In a typical kernel, we of course want to use the more elaborate operations as much as possible, since we then perform more operations per cycle. Scalar-scalar operations especially should be avoided since we only use one lane (or two if we use doubles) and poorly utilize all the datapath hardware. To avoid making the datapath a critical delay path in the processor, the datapath is divided in three pipeline steps according to figure 3.3. First comes a multiplier stage, which consists of 16 multipliers (two for each lane), each capable of multiplying two 16 bit two’s complement numbers, either signed or unsigned. If we have double words as input data, each number will occupy two datapath lanes and 32 by 32 multiplication will be achieved using several 16 by 16 multipliers. Complex words can also be multiplied directly by recognizing that

(a + bi) · (c + di) = ac − bd + i(ad + bc). 14 Sleipnir SIMD co-processor

Figure 3.3. The Sleipnir datapath

This translates the complex multiplication into four real multiplications and two additions/subtractions that can be performed in the next pipeline stage. Element- wise multiplication of two vectors of complex words would therefore utilize all 16 multipliers simultaneously. The two stages following the multiplier stage are two ALU stages. The first ALU stage contains mostly adders that are required for e.g. triangular operations and complex multiplications. The second ALU stage contains adders, but also logical units, shifters and other typical DSP post-processing operations like satu- ration and rounding. This stage also has direct access to the vector accumulator register, in order for it to be able to perform MAC and MAC-like operations. All instructions will of course not use all the datapath hardware and Sleipnir therefore classifies operations as either long or short datapath operations. Long operations use all datapath stages, while short operations bypass both the multi- plier stage and the first ALU stage. This provides the possibility of getting lower latency for instructions that for example do not need to use the multipliers. To exemplify the configurability of the datapath, we shall here look at a specific example, namely the triangular MAC (TMAC) operation. For two real-valued word vectors a and b the operation performs

t = a0b0 + a1b1 + a2b2 + a3b3 + a4b4 + a5b5 + a6b6 + a7b7

The result t can be written to the vector accumulator register, in order to be able to multiply and accumulate two or more additional vectors to the result. Otherwise it can be written back to the vector register file or any of the local vector memories. The configuration of the datapath during the execution of this operation will conceptually look like in figure 3.4. Currently there is no direct hardware support for computing divisions or tran- scendental functions. These kinds of functions are however quite used in many DSP applications. The intention is to support such functions by the use of Taylor approximations, that will heavily rely on the use of the multipliers. This is however still ongoing research and any final decisions regarding this has not yet been made. 3.4 Memory subsystem 15

Figure 3.4. Datapath configured for execution of triangular MAC

3.4 Memory subsystem

To be able to use all the potential parallel performance of the datapath, effective storage and retrieval of data is very important. For low latency access to data, a vector register file is provided, that supports two reads and one write per clock cycle. It is possible to retrieve vectors but also individual words, doubles or half- vectors from the VRF. Since register files are typically quite hardware expensive, the VRF has a quite limited size of eight 128 bits vectors. For larger amounts of data, memories must be used. Also, the actual program uses its own separate memory. The program memory, the CM and the LVMs will be further described in the following three sections.

3.4.1 Program memory The program memory is loaded using DMA before execution in the SIMD can commence. Since this memory has to be loaded every time a new task should be run, it is beneficial if the computing kernel is quite small, so that less program data has to be transferred. If the same task should be run several times however, the same program data can of course be reused. Little has been decided on the actual layout and size of the program memory. The layout will be influenced a lot by the final instruction set of Sleipnir that is not decided either. It is of course very important to minimize the amount of memory a specific task requires, since task loading times are directly proportional to the code size.

3.4.2 Constant memory The constant memory can be used for data that is intended to remain constant during SIMD kernel execution and is only written at task loading. Typical contents can be algorithm coefficients and look-up tables but also addressing patterns for the LVMs, which will be further described in section 3.4.3.2. Currently, the constant memory can hold 256 128-bit vectors, but this is subject to change. 16 Sleipnir SIMD co-processor

The constant memory is for now only vector addressable, meaning that any single scalar cannot directly be accessed from the CM. The main motivation for this is that it reduces hardware cost. The CM can be addressed in the assembly code either using absolute addresses or by using any of the two constant memory address registers (CAR0/CAR1), that belong to the special register file. CAR addressing also supports post-increment of the CAR register and modulo addressing, by providing top and bottom registers in the special register file. A few examples of assembly syntax:

- cm[ABSOLUTE_ADDRESS] - absolute addressing

- cm[car0] - address register addressing

- cm[car0 + CONSTANT] - address register added with a constant

- cm[car0+=1] - address register addressing with post-increment to continue to the next vector

- cm[car0+=1%] - address register addressing with modulo post-increment

3.4.3 Local vector memories The LVMs provide the main storage space for data and each SIMD has three LVMs. Only two of the LVMs are however accessible to the computing kernel during execution. The third memory is instead used for data transfers to and from the OCN. While executing on the SIMD, DMA can be used to fill the third LVM with new input data. When the SIMD finishes execution of a kernel, the LVMs can switch roles so that the LVM that have just been filled with new input data now is accessible to the kernel. From the SIMDs point-of-view the data loading time is therefore virtually zero, since the loading is overlapped with computing. One of the previously accessible memories then has to be switched out. The choice is of course to switch out the memory that contains the output data from the previous computation round, so that output data can be read, while the SIMD already is running the next iteration. When the output data has been read, new input can be supplied again and the situation repeats. The handling of memory switches is performed by only letting the kernel program know of two memory ports, m0 and m1, and let a switch decide at each time which memory is connected to which port.

3.4.3.1 Memory organization The final size of the LVMs has not been decided, but are considered to be able to hold around 5000 128-bit vectors. An LVM is however not implemented as one large memory but rather as eight scratchpad memories, each with a width of 16 bits. This gives a lot of freedom in how to access data. The LVMs are word addressable, meaning that any single 16 bit word can be addressed directly. By combining the data from several addresses we can create doubles, half-vectors and full data vectors. For example, if we read a vector with start address 3, we 3.4 Memory subsystem 17

Figure 3.5. Generation of a 128-bit data vector (memory addresses in hexadecimal)

would then receive a data vector, with the contents from address 3 and the seven consecutive addresses. The situation is depicted in figure 3.5.

3.4.3.2 Data permutations

Some algorithms operate on long data vectors, but need to access the data in another order than a linear order. Since we want to work on data vectors as much as possible to achieve more parallelism, it would be beneficial if it was possible to generate data vectors out of more complex patterns. The constant memory, in combination with the permutation switch depicted in figure 3.5 provides that possibility. By storing an access pattern in the constant memory, we can access an LVM with this pattern, instead of the usual eight consecutive addresses. Figure 3.6 illustrates the storage of an 8x8 matrix in an LVM. The matrix is stored in row-major order which means that rows are stored consecutively in linear memory (in contrast to column-major order, where columns are stored after each other). Any row can be read from memory using the standard way of generating data vectors, since row elements are stored consecutively. Using the data permu- tation capability of Sleipnir, it would also be possible to access the entire diagonal of the matrix in parallel, by storing the access pattern [0,9,18,27,36,45,54,63] in the constant memory and then use this to do a look-up in the LVM. However, it will not be possible to access an entire column in parallel using this storage scheme. The reason is that all column elements are stored in the same physical scratchpad memory. Each scratchpad memory only has a single port, which allows a single read or write per clock cycle. Any access pattern that requires two or more simultaneous accesses to the same scratchpad memory will therefore not work. If a column still needs to be accessed, it has to be accessed sequentially over several clock cycles, which will degrade performance. The conclusion is that the programmer must carefully choose how to distribute data, so that conflict free parallel memory access can be achieved. Otherwise we will get a significant reduction of memory access performance. 18 Sleipnir SIMD co-processor

Figure 3.6. Storage of an 8x8 matrix

3.4.3.3 Addressing The LVMs can be addressed in many ways. Like the constant memory, they can be addressed with an absolute address or with address registers (AR) from the special register file. There are four address registers that can be used for any LVM and each have their own top and bottom registers to support modulo addressing. Each address register also has a corresponding step register, which makes it possible to post-increment the address register by the value present in the step register. As described in the previous section, it is also possible to supply an access pattern from the constant memory. The VRF can be used to provide an offset or an access pattern in the memory. Here follows a few examples of assembly syntax for different addressing modes:

- m0[ABSOLUTE_ADDRESS] - absolute addressing. - m0[ar0+=C] - address register addressing with post-increment. C can be either 1,2,4 or 8 for progressing a word, double, half-vector or vector respec- tively. - m0[ar0-=C] - same as above with decrement, instead of increment. - m0[ar0+=S%] - address register addressing with modulo post-increment. S is the value contained in the corresponding step register. - m0[cm[car0]] - addressing with pattern from constant memory. - m0[ar0 + cm[car0 + CONSTANT]] - index from address register is added to the pattern from the constant memory. - m0[vr0] - vector register addressing with pattern from a vector register.

By using combinations of the various addressing possibilities, and storing a table of access patterns in the constant memory, very complicated access patterns can be achieved, for example m0[ar0+=S% + cm[car0+=1% + CONSTANT]. There are a lot of possibilities for inventive use of these features. 3.5 Assembly instruction set 19

3.4.3.4 Access modes The LVMs support a number of access modes, that control address generation and how much data is actually retrieved from the LVMs. This is indicated by adding a suffix to the memory operands. A few examples are: - m0[0].sw - Access a scalar word from address 0 (16 bits). - m0[0].sd - Access a scalar double from addresses 0 and 1 (32 bits). - m0[0].vw - Access a vector from addresses 0 to 7 (128 bits). The vector register file also uses a similar scheme where vr0.0 will retrieve a scalar word, vr0.0d will retrieve a scalar double and vr0 will access the whole vector register.

3.5 Assembly instruction set

Writing programs for Sleipnir is done by using the Sleipnir assembly language. Since today’s compiler technology is not sophisticated enough to handle vector- ization of high level language code, compiled code would probably not be able to utilize all the parallel processing capabilities of Sleipnir. It is therefore very unlikely that there will be a compiler available later for Sleipnir. Computing kernels should typically be quite small though, making it manageable in assembly language. Table 3.1 lists a small selection of the instruction set [5].

3.5.1 Instruction format The general instruction format looks like iter * instr.cdt dst src0 src1 All fields do not apply to all instructions of course. Here follows a short description of the different fields: • instr - instruction mnemonic • dst - data destination • src0/src1 - source operands • iter - perform the same instruction more than once (3.5.1.1) • cdt - make instruction condition dependent (3.5.1.2) • options - additional options to the instruction (3.5.1.3) The destination can be either the VRF, the VACR, any of the LVMs or any special purpose register depending on the instruction. Source operands can come from the VRF, the VACR, any of the LVMs, any special purpose register or the CM. Iterations, conditions and options are very useful in many situations and will be described more in detail in the three following subsections. 20 Sleipnir SIMD co-processor

Table 3.1. An small excerpt of the Sleipnir instruction set

Mnemonic Description SCOPYW Copy scalar word VCOPY Copy vector JMPQ Jump to immediate address INTQ Interrupt master CALLQ Call immediate address RET Return from subroutine STOP Stop execution R4BF Radix-4 butterfly REPEAT Hardware looping SADDW Scalar word addition SMAXD Scalar maximum double word SORW Scalar logic OR word SCMUL Scalar complex-complex multiplication SMACW Scalar multiply and accumulate TCMAC Triangular complex multiply and accumulate VADDD Vector double word addition VABSW Vector absolute word VLSLW Vector logical left shift word VCMAC Vector complex multiply and accumulate VSCMAC Vector-scalar complex multiply and accumulate

3.5.1.1 Single instruction iterations Sleipnir supports issuing of the same instruction many times consecutively, by specifying the number of iterations in the assembly code. Consider for example a case where a vector of 1024 elements is stored in m0 and we want to add to that vector a 128-bit vector constant stored in the VRF. Using the vector add instruction, that operation could then be written in a single assembly line as

128 * vadd m1[ar1+=8].vw m0[ar0+=8].vw vr0

This feature can greatly decrease code size and removes the overhead associated with creating a hardware or software loop.

3.5.1.2 Conditions Traditionally, conditions are something associated with conditional jumps. An arithmetic instruction, for example, sets the flag register according to the result of a computation. A following conditional branch instruction is then either taken or not taken, depending on the contents of the flag register. In the case of jumps, this concept of course do not apply to vectors. In Sleipnir, the concept is generalized also to other instruction types. The condition then works as a write mask according to figure 3.7. An arithmetic 3.6 The Sleipnir pipeline 21

Figure 3.7. Conceptual view of the write mask for arithmetic instructions using conditions

instruction, for example, sets the flags according to the result of its computation. Following instructions can then use conditions to write their results or not de- pending on the flags set by the first instruction. The flags are typical flags for a processor, namely zero, overflow, negative and carry, which makes it possible to check conditions like equal, greater than etc.

3.5.1.3 Instruction options The options field lets you use a variety of options for the instructions. Of course does not all options apply to all instructions. Some possible options are:

• noflag - the instruction does not change any flags

• rnd/sat - perform rounding/saturation

• ss/us/su/uu - specifies if source 0/1 should be interpreted as signed or unsigned

• clr - clear VACR after data has been outputted

• conj0/conj1 - used to conjugate source operand 0/1 for complex data in- structions

• scale=xx - used for scaling output data

3.6 The Sleipnir pipeline

To achieve high parallelism, Sleipnir has a quite long pipeline. With a long pipeline many instructions are simultaneously in various stages of execution. Since many instructions do not need all pipeline stages, they can be bypassed and the number of pipeline stages used varies a lot. The number of pipeline steps taken also depend on what inputs are used to the instructions. LVM accesses are performed 22 Sleipnir SIMD co-processor using several pipeline steps, while the register file can be accessed in a single cycle. All in all it can take anywhere from 2 to 13 cycles before an instruction is completely executed, depending on what instruction it is and what destination and source operands are used. Figure 3.8 depicts a simplified overview of the Sleipnir pipeline. Note that after the stage D4, write-back to for example an LVM or the VRF might occur. This is done using the same pipeline stages A1-A4 that also is used for doing data reads. As described earlier, datapath operations are classified as either long or short, but also inputs and outputs are also classified as long or short. LVM accesses are long inputs/outputs, since several steps are required in order to calculate memory addresses etc. Register accesses are in contrast short. If an operation is using one source operand from memory and one from a register, the access type will be long to accommodate the memory access. Figure 3.9 depicts the usage of the pipeline under different conditions. Note that stages A1-A4 have been greyed, where they are used for write-back. The Sleipnir micro-architecture currently has very few hazard detection mech- anisms. No automatic stall will be performed if either a data hazard or structural hazard occurs. It is up to the programmer to write code that avoid hazards. In some cases this can be done by rearranging code, while sometimes, no-operations (NOPs) have to be inserted in the code. The various pipeline lengths makes it extra tricky to avoid hazards. A future simulator environment could possibly warn for these situations or hardware detection could be added. 3.6 The Sleipnir pipeline 23

Figure 3.8. Sleipnir pipeline overview 24 Sleipnir SIMD co-processor

Figure 3.9. Pipeline usage during different execution conditions Chapter 4

Matrix algebra and algorithm derivation

This chapter will be devoted to presenting the matrix operations that are in focus in this thesis. A specific operation can typically be performed in many ways, that differ a lot in terms of arithmetic operations required and . Since we want to utilize the parallel datapath of Sleipnir as much as possible, it is very important that the chosen algorithms are inherently parallel. In this chapter, such algorithms will be presented and discussed.

4.1 QR Decomposition

The QR decomposition is the process of decomposing an m×n matrix A according to A = QR where Q is an orthogonal m×m matrix and R is an upper triangular m×n matrix. An is a square matrix with orthogonal unit vectors. This means that QT Q = QQT = I or QT = Q−1 where QT is the matrix transpose of Q and I is the identity matrix. The relations also hold for the complex case, if QT is changed into QH , where QH is the Hermitian or conjugate transpose of Q. The QR decomposition has various uses, but one of the most common uses is finding the least-squares solution to an overdetermined equation system [6].

4.1.1 Methods for QR Decomposition The three main methods for computing the QR decomposition are

• Gram-Schmidt orthogonalization

25 26 Matrix algebra and algorithm derivation

• Householder reflections

• Givens rotations

The Gram-Schmidt orthogonalization process creates the orthogonal Q-matrix, by iteratively applying orthogonal projections. Each iteration produces an additional vector in the Q and R matrices. A Householder reflection is defined as 2 P = I − vvT vT v where v is called the Householder vector. If P is multiplied by a vector x, x will be reflected in the hyperplane that is orthogonal to v. QR decomposition using House- holder reflections involves calculating Householder vectors and applying reflections to gradually transform a matrix into its Q and R matrices. Givens rotations find rotation matrices, that when multiplied with the input matrix, zeroes one element below the main diagonal. This will yield the R matrix and Q is then obtained as the product of all transposed rotation matrices. All three methods are thoroughly discussed in [6]. If A is an n × n matrix, all three methods have a time complexity of O(n3). The Gram-Schmidt method involves the manipulation of equal length vectors, while the Householder method manipulates vectors with different lengths in each iteration and requires some more execution-time decisions. This indicates that the Gram-Schmidt method would run faster on Sleipnir. Decision-making in a computer program is typically a sequential part, operating on scalars, which would reduce datapath utilization. Generation of rotation matrices in the Givens rotation method typically requires either trigonometric function evaluation or several divi- sions, which are computing intensive operations that we want to avoid. Except for simple operations, like additions and multiplications,√ the Gram-Schmidt method on the other hand, requires only a single 1/ x operation per iteration. The downside with the Gram-Schmidt method is that typically less numerically stable than the Householder method. This could be a problem, especially if the matrix to be decomposed is very large. Sleipnir’s current LVM size of 5000 vectors however limits the largest matrix that could be decomposed to around 140×140, if not several LVM loads are performed. The precision needed is of course application dependent. What precision we can get from the Gram-Schmidt method will be investigated in chapter 5.

4.1.2 Gram-Schmidt orthogonalization

If we let (q|a) denote the dot product between vectors q and a and let a1 to an and q1 to qn symbolize the column vectors of A and Q respectively, the QR decomposition can be written as ([16], [13]):       (q1|a1) ... (q1|an) a a . . . a q q . . . q  .. .  A = QR ⇔  1 2 n =  1 2 n  0 . .  (4.1) 0 0 (qm|an) 4.1 QR Decomposition 27

Since we want the column vectors of Q to be of unit length, we can with our dot product notation write the orthogonal projection of a vector ai onto qj as (qj|ai)qj. The entire QR decomposition process can then be written as

v1 v1 = a1, q1 = ||v1|| v2 v2 = a2 − (q1|a2)q1, q2 = ||v2|| v3 v3 = a3 − (q1|a3)q1 − (q2|a3)q2, q3 = ||v3|| . . n−1 X vn vn = an − (qk|an)qk, qn = ||vn|| k=1 Notice how the first column vector of A becomes the first unit vector of Q by simply dividing the a1-vector by its own length. The rest of the q-vectors are produced by subtracting all projections of the corresponding a-vectors on the previously calculated q-vectors. By using this procedure, we ensure that all q- vectors are orthogonal to each other and the norming transforms all vectors into unit vectors. By multiplying the Q and R matrices as they are shown in equation 4.1 and using that ||vi|| = (qi|ai), we get the same equations as those above, by just slightly rearranging the√ equations. Except for the norming operations, that requires calculation of 1/ x, we just use simple additions, subtractions and multiplications, which Sleipnir should be well suited for.

4.1.3 Rectangular matrices As described in section 4.1, Q should be an orthogonal matrix. In the case of m > n, the Gram-Schmidt method will however return an m × n Q-matrix and an n×n R-matrix. This is commonly known as the thin or skinny QR decomposition. This means that the relation QT = Q−1 does no longer hold, since Q is not square and therefore not invertible. If this is a problem or not of course depends on the application. Q will still be a matrix of orthonormal column vectors. If m > n, Q as an m × m-matrix is not uniquely defined. This can be seen if we write the decomposition as R  A = QR = (Q Q ) 1 = Q R 1 2 0 1 1 where Q is an m × m matrix, Q1 is m × n, Q2 is m × (m − n), R is m × n and R1 is n × n. Gram-Schmidt’s method would return Q1 and R1. If we instead want the entire Q-matrix we see that Q2 could be chosen to anything that makes Q orthogonal, since Q2 does not affect the actual decomposition. The skinny QR decomposition introduces considerable savings in memory space. Consider for example the amount of space required for storing Q as m×n, instead of m × m, if m  n. As we soon shall see, computation time will also decrease, since the number of computational iterations needed will be fewer. 28 Matrix algebra and algorithm derivation

4.1.4 Algorithm

The Gram-Schmidt process described in section 4.1.2 is commonly known as the classical Gram-Schmidt orthogonalization process. It can be shown that this process is not numerically stable in finite precision arithmetic. By just slightly rearranging the computations, we end up with a variation of the classical Gram-Schmidt method, commonly known as the modified Gram-Schmidt (MGS) orthogonalization process. This method is stable and will therefore be chosen. In [1], both methods are thoroughly explained and compared. The MGS process can in Matlab code be written as in listing 4.1 [11].

Listing 4.1. Modified Gram-Schmidt orthogonalization algorithm

1 function [Q,R] = qrd(A);

3 [m, n ] = size (A); 4 Q = zeros (m, min(m, n ) ) ; 5 R = zeros (min(m, n ) , n ) ;

7 for i = 1 :min(m, n )

9 R( i , i ) = sqrt (sum( abs (A(1:m, i ).^2))); 10 Q(1:m,i) = A(1:m,i) / R(i,i);

12 for j = ( i +1):n 13 R(i,j) = Q(1:m,i)’ ∗ A( 1 :m, j ) ; 14 A(1:m,j) = A(1:m,j) − R( i , j ) ∗ Q( 1 :m, i ) ; 15 end 16 end

The Matlab code in listing 4.1 has been written to support rectangular matrices of any size, both real and complex. The differences between the real and complex cases are very few. In the case of real data, the absolute operation on line 9 is redundant, since any real number squared, will anyway turn out positive. On line 13 it is important to note that the dot product requires Q to be conjugated in the complex case. The ’ operator in Matlab however does conjugate transpose, which is what we want. As can be seen in listing 4.1, the MGS orthogonalization algorithm have data dependencies from one step to the next in every step. It is therefore not possible parallelize the algorithm by starting computations from several places is the code. Each individual step however has inherent parallelism. The dot product on line 13 for example, can be split in vectors of eight elements and be performed using the TMAC instruction (in the real case). Matrix multiplications are also easy to parallelize. 4.2 LU Decomposition 29

4.2 LU Decomposition

The LU decomposition (LUD) is the decomposition of an m × n-matrix A into a lower triangular matrix L and a upper triangular matrix U, according to

A = LU

If m ≥ n, L will be m × n and U will be n × n, otherwise L will be m × m and U will be m × n. The LU decomposition closely resembles the procedure of and can be used for solving systems of linear equations [3]. As we shall see later, the LU decomposition can also be used to quickly find the determinant and inverse of a square matrix. With the only requirement that both matrices should be triangular, we gen- erally can find an infinite number of possible LU decompositions. If one how- ever makes the additional requirement, that one of the two matrices should be uni-triangular (only 1:s in the main diagonal), we get a unique decomposition. Deciding which matrix should be uni-triangular can be done arbitrarily. Using the Doolittle approach, which has been chosen for implementation, L will be uni- triangular. The final result will then take the following form:

    1 0 ··· 0 u11 u12 ··· u1n  .. .  .. .   l21 1 . .  0 u22 . .  A = LU =      . .. ..   ......   . . . 0  . . . .  lm1 ··· lm−1n 1 0 ··· 0 unn

4.2.1 Calculation

Calculation of the LU decomposition will here be exemplified using an A-matrix of size 3 × 3. This example will reveal the pattern for which we can calculate the LU decomposition of a matrix of any size. We start by writing the matrices as

      a11 a12 a13 1 0 0 u11 u12 u13 A = LU ⇔ a21 a22 a23 = l21 1 0   0 u22 u23 a31 a32 a33 l31 l32 1 0 0 u33 30 Matrix algebra and algorithm derivation

By multiplying L with U, we get the resulting equation system:   u11 = a11 a11 = u11   u12 = a12 a = u   12 12 u = a   13 13 a13 = u13  a21  l21 = a = l u  u11  21 21 11  u22 = a22 − l21u12 a22 = l21u12 + u22 ⇐⇒  u23 = a23 − l21u13 a23 = l21u13 + u23  a  l = 31 a31 = l31u11  31   u11   a32 = l31u12 + l32u22  a32 − l31u12  l32 =   u a33 = l31u13 + l32u23 + u33  22 u33 = a33 − (l31u13 + l32u23) By looking at the equation system above, it’s not hard to realize that the elements generally follow the following two equations:

j−1 X aij − likukj j−1 k=1 X lij = and uij = aij − likukj ujj k=1 It’s worth noting that the equations are data dependent. In our 3 × 3 example above, we can for example see that calculation of u22 and u23 requires us to have previously calculated u12, u13 and l21 (which in turn depends on u11). An algorithm that calculates the LU decomposition must ensure that all data dependencies are resolved. Implementing LU decomposition as described above will result in an numer- ically unstable implementation. To ensure stability, we interchange rows in the A-matrix [6]. The operation is commonly known as row pivoting. The resulting LU decomposition will then be on the form PA = LU where P is a permutation matrix, which corresponds to the row interchanges in A. The row pivoting process will be described more in detail in the section 4.2.3.

4.2.2 Memory storage Before presenting the algorithm for LU decomposition, we make a quick note on memory storage. Since both the L and U matrices contain many zeroes, we can store them as       1 0 ··· 0 u11 u12 ··· u1n u11 u12 ··· u1n  .. .  .. .   .. .   l21 1 . .  0 u22 . .   l21 u22 . .      =⇒    . .. ..   ......   ......   . . . 0  . . . .   . . . .  lm1 ··· lm−1n 1 0 ··· 0 unn lm1 ··· lm−1n unn 4.2 LU Decomposition 31

This reduces the memory space required by 50%. The matrix P can also be reduced. Instead of storing it as an m×m matrix, we can store it as a permutation vector of length m, which lists the the permutation order of the A matrix. The combined L and U matrices will from now on be referred to as the LU matrix.

4.2.3 Algorithm LU decomposition with row pivoting can be performed as in listing 4.2. The computations are organized so that A is gradually overwritten by L and U, which are stored as described in the previous section. Every iteration finalizes another row in the LU matrix [3]. Listing 4.2. LU decomposition algorithm

1 function [A,p] = lud(A)

3 [m, n ] = size (A); 4 p = 1 :m;

6 for k = 1 :min(m−1,n ) 7 [A,p] = pivot(A,p,k);

9 i = ( k +1):m; 10 j = ( k +1):n ; 11 A(i ,k) = A(i ,k)/A(k,k); 12 A(i,j) =A(i,j) − A( i , k )∗A( k , j ) ; 13 end As in the QR decomposition case, we have data dependencies between the different lines of code in the Matlab algorithm. The individual operations can however be parallelized. Row number 7 performs the row pivoting. In the kth iteration, row pivoting involves searching for the maximum absolute value of the elements akk to amk. The row that has the element with the maximum absolute value will switch place with the kth row. The procedure is listed in listing 4.3. Listing 4.3. Row pivoting procedure

1 function [A,p] = pivot(A,p,k) 2 [m, n ] = size (A); 3 [ y , idx ] = max( abs (A(k:m,k))); 4 idx = idx + k − 1 ;

6 row = A(k,:); 7 A(k,:) = A(idx ,:); 8 A(idx,:) = row;

10 t = p ( k ) ; 11 p(k) = p(idx); 12 p ( idx ) = t ; 32 Matrix algebra and algorithm derivation

4.3 Matrix determinant

The determinant of a matrix, here written as det(A), is a common operation in many applications. The determinant of a square n × n matrix can be defined according to ([2]): 1. If A is an 1 × 1 matrix with element a, then det(A) = a 2. If A is an n × n matrix, the determinant of the (n − 1) × (n − 1) sub-matrix, with the ith row and jth column removed, is called the minor Mij. The determinant can then be calculated as n X i+j det(A) = (−1) aijMij for any i = 1, ..., n j=1

Direct calculation of the determinant using this definition has a time complexity of O(n!), due to the recursive nature of the definition. There are also other definitions, but they share the same time complexity problem. With an O(n!) time complexity, the computation time grows very rapidly with the size of the matrix. Remembering that det(AB) = det(A) · det(B) however allows us to use the LU decomposition to obtain the determinant in O(n3) time. This is a result of the fact that the determinant of a triangular matrix is the product of all elements in the main diagonal. Also, since L is uni-triangular, all diagonal elements of L are 1 and therefore det(L) = 1. In short,

n Y det(A) = det(L) · det(U) = det(U) = uii i=1 The result above holds if we do not perform row pivoting, which we in case of LU decomposition typically do. For each row interchange in matrix A, the value of the determinant, however, changes only by its sign. Therefore we can write the calculation with row pivoting as

n s s Y det(A) = (−1) · det(U) = det(U) = (−1) · uii i=1 where s is the number of row interchanges in the LU decomposition.

4.3.1 Algorithm Since we use the LU decomposition for finding the determinant, the algorithm in section 4.2.3 still holds. We simply modify that algorithm slightly, so that it keeps track of the number of row interchanges performed. Normally rows will be interchanged, but it can happen that the current row is the row with the largest absolute aki value (i = k, ..., m), in which case the current row should be kept. After the LU decomposition has been performed, we compute the product of all main diagonal elements. Finally we sign compensate for the row pivoting and we have then found the determinant value. The algorithm is presented in listing 4.4. 4.4 Matrix inversion 33

Listing 4.4. Determinant algorithm

1 function d = det (A)

3 d = 1 ; 4 [m, n ] = size (A); 5 [A,s] = luds(A);

7 for i = 1 : n 8 d = d ∗ A( i , i ) ; 9 end

11 d = d ∗ (−1)^ s ;

4.4 Matrix inversion

Matrix inversion is the last operation that will be investigated in this thesis. In general, calculation of the inverse has very poor numerical properties and the results can be very far from ideal. Calculating the inverse explicitly should therefore be avoided, as far as possible. Fortunately, it usually is. A common example is solving a system of linear equations, which in textbooks often is written as Ax = b ⇔ x = A−1b

To calculate the solution x by inverting A and multiplying it with b, would not only be less accurate, but also slower, than for example solving by using LU decomposition. In fact, every time A−1b shows up in a formula, a better approach is to solve Ax = b. In some situations however, inverse calculation can’t be avoided. For example, it could be that the elements of the inverse have some special physical interpreta- tion that could be of interest. One should then keep in mind that the result might be quite inaccurate, especially if the matrix is large. Also, the result will become much worse, if the matrix is close to singular.

4.4.1 Triangular matrices

We start out by considering the special case when the matrix to be inverted is triangular. The reason for this is that triangular matrices are more easily invertible. A property of the inverse to a triangular matrix is that it is also triangular and a uni-triangular inverse is uni-triangular [6]. We derive the method by considering an upper triangular matrix. We want to find the solution B = A−1 34 Matrix algebra and algorithm derivation to the following system:       b11 b12 ··· b1n a11 a12 ··· a1n 1 0 ··· 0  .. .   .. .   .. .   0 b22 . .   0 a22 . .  0 1 . .  BA = I ⇐⇒     =    ......   ......  . .. ..   . . . .   . . . .  . . . 0  0 ··· 0 bnn 0 ··· 0 ann 0 ··· 0 1

By multiplying B with A we get the following equation system:  b a = 1  11 11  b22a22 = 1  b11a12 + b12a22 = 0  b a = 1  33 33  b11a13 + b12a23 + b13a33 = 0 b a + b a = 0  22 23 23 33  .  .   bnnann = 1  n X  b a = 0 for j = 1, ..., n − 1  ji in i=j

By solving the unknown b elements, we end up with the following expressions (for k = 1, ..., n):  1 bkk =  akk  k−1 X bjiaik   i=j bjk = − for j = 1, ..., k − 1 akk Except for the division required, the operations required are simple. Matlab code for the algorithm is shown in listing 4.5.

Listing 4.5. Inverse of upper triangular matrix

1 function Uinv = invu(U)

3 [m, n ] = size (U); 4 Uinv = zeros (m, n ) ;

6 for k = 1 : n 7 i = 1 : ( k −1); 8 Uinv ( i , k ) = − Uinv ( i , i )∗U(i ,k)/U(k,k); 9 Uinv(k,k) = 1/U(k,k); 10 end 4.4 Matrix inversion 35

The algorithm for finding the inverse of a lower triangular matrix can be derived in a similar manner. We can note from listing 4.5, that if the matrix would be uni-triangular, all Uinv(i, i) elements would be 1 and then we wouldn’t need to perform any divisions. Listing 4.6 lists the algorithm for a lower uni-triangular matrix. Since divisions typically need many clock cycles, we can save a lot of cycles by removing them.

Listing 4.6. Inverse of lower uni-triangular matrix

1 function Linv = invl(L)

3 [m, n ] = size (L); 4 Linv = zeros (m, n ) ;

6 for k = 1 : n 7 i = 1 : ( k −1); 8 Linv ( k , i ) = − L( k , i )∗ Linv ( i , i ) ; 9 Linv(k,k) = 1; 10 end

4.4.2 General matrices The inverse of a general matrix could be calculated with traditional Gauss-Jordan elimination. Another more manageable method, is to break up the inversion in smaller steps. This can be done using either the QR or the LU decomposition [14], [3]. Starting with PA = LU, we can invert A by using

(PA)−1 = (LU)−1 ⇐⇒ A−1P −1 = U −1L−1 ⇐⇒ A−1 = U −1L−1P.

With A = QR we can use that

A−1 = R−1Q−1 ⇐⇒ /Q−1 = QH / ⇐⇒ A−1 = R−1QH .

This reduces the problem into finding the inverse of triangular matrices and performing matrix multiplication. Both the QR and the LU way of perform- ing inversion will be further investigated, both in terms of execution time and numerical accuracy.

Chapter 5

Algorithm evaluation

This chapter will be devoted to evaluating the algorithms that were presented in chapter 4. The algorithms will be evaluated using Matlab, where we simulate the conditions and precision that we would get in Sleipnir. We will derive theoretical lower boundaries on computation time, so that we later can compare this with the actual computation times achieved. A significant part of this chapter will be devoted to investigating what precision can be achieved from the different algorithms and what limitations the algorithms have.

5.1 Fixed-point calculations

In Sleipnir, all calculations will be performed using fixed-point arithmetic. Fixed- point arithmetic has several advantages over floating-point, for example lower hardware cost and lower power consumption. Fixed-point however has several disadvantages. In floating-point arithmetic, the data is automatically scaled so that a maximum number of significant bits always are stored. In fixed-point, scaling has to be done manually and choosing good scalings is an involved task. It requires us to thoroughly analyze the code and determine what magnitudes can be expected on the numbers in different situations. We then choose a scaling which trade-offs supported magnitudes and precision [10]. An important initial step is to decide what format you expect your input data to have. In this work, the data elements are expected to be signed 16-bit numbers, with one sign bit and 15 fractional bits. This gives us a number range from −1 to 1 − 2−15. The resolution is 2−15 ≈ 0.00003. We adopt the notation from [9], to describe what number format we have, which is QD(m.f). D is the base, m is the combined total of the sign and integer bits and f is the number of fractional bits. Our input data is therefore in Q2(1.15) format. In order to be able to create kernels that utilize the available bits well, it is important to know the characteristics of the operations performed. Additions and subtractions can be performed without considering the number format. If the representable range is exceeded this is typically handled with saturation, which means that we set the result to the closest representable number in the number

37 38 Algorithm evaluation

range. Multiplication of two Q2(m.f) numbers results in a Q2(2m.2f) number, which means that the number of bits has doubled [9]. If this result is an output from an algorithm, bits have to be removed, since we want our results to also be 16-bit numbers. If we multiply two Q2(1.15) numbers, x and y, we can note that since |x|, |y| ≤ 1 (due to the number range), |x · y| ≤ 1. The result could therefore also be represented in Q2(1.15) format, without overflow. The 15 less significant bits will be truncated, which will introduce a truncation error (or rounding error if rounding is used). Sleipnir will not support division in the sense that it performs y/x directly. Instead there will be support for computing 1/x and that result could then be multiplied by y to produce y/x. A result of using Q2(1.15) format is that the 1/x operation will overflow. Since |x| ≤ 1, |1/x| will be ≥ 1 and saturation is of course not a good solution. The number range instead has to be extended somehow. One way is to scale the result to another number format, for example Q2(8.8). It would then be possible to represent numbers from −128 to 128 − 2−8, but we would lose precision. Another possibility would be to temporarily use a 32-bit format, for example Q2(17.15). We then keep the same resolution, but allow a much larger number range. This approach is very useful when we know that the data temporarily will be out of range, but following operations will put it back in range again. To evaluate the algorithms Matlab will be used. Matlab uses a 64-bit floating point format, which makes it possible to perform the decompositions with very high precision. Since the 64-bit floating-point format is a much more precise format than 16-bit fixed-point, we consider the 64-bit floating-point calculated result to be the “exact” correct result. We can also simulate the algorithms as if they where performed in 16-bit fixed-point by truncation or rounding to 16-bit precision after each step in the Matlab code. This allows us to compare the “exact” result with a 16-bit fixed-point calculated result. This is the main method used in this work, for evaluating the possible precision of the algorithms.

5.2 Input data scaling

When performing computations in fixed-point, it’s important to use well-scaled input data. This is important in order to avoid overflow/underflow and to be able to keep as many significant bits as possible. The situation will here be explained by using the QR decomposition algorithm as an example. As can be seen in listing 4.1, the first step of the QR decomposition is to find the length of the first column vector (a1) in matrix A. This is done as v q u m 2 2 2 uX 2 |a1| = a11 + a21 + ... + am1 = t ai1. i=1

Pm 2 Since |a1| will be r11 in the R matrix, it is important that i=1 ai1 doesn’t overflow the number range chosen for the elements in the R matrix (which has been chosen to Q2(1.15)). If it happens anyway, saturation will be used. √ 5.3 Precision of 1/x and 1/ x 39

Squaring the aii elements results in numbers in a Q2(2.30) format. All elements can be accumulated in any of the 40-bit accumulator registers that Sleipnir has. When the accumulation is done and the data should be read and rounded to Q2(1.15) format, it can happen that the result is zero, because the vector elements were too small. Even if the result doesn’t underflow, it is still bad with a very small result, since the relative error of rounding can be much larger for numbers with a small magnitude. The implications of input data scaling will be further investigated later.

√ 5.3 Precision of 1/x and 1/ x √ The implementation of functions like 1/x and 1/ x is currently under investiga- tion. One possible solution is to use Taylor approximations and the precision of the computed result will then depend, for example, on how many terms are used in the Taylor expansion. The investigations that have been done so far indicate that we can get very good precision using Taylor approximations, possibly in combination with Goldsmith iterations and software floating-point. What the final precision will be and what the cycle time of such instructions will be however remains to be seen. In order to be able to do precision analysis, I have therefore made the assumption, that these operations do not introduce any error, except for the rounding to get to a 16 or 32 bit result.

5.4 Matrix condition number

The 2-norm matrix condition number can be calculated as

σ (A) κ(A) = max σmin(A) where σmax(A) and σmin(A) are the maximum and minimum singular values of A, which can be found by using singular value decomposition (SVD) [16]. The details about how to calculate the condition number are not important here. Instead we are interested in what we can use it for. The condition number is a measure of the sensitivity to errors in data when solving a system of linear equations. The value of the condition number therefore can be used as an indication of how accurate the result will be, if for example matrix inversion is performed on the matrix. The condition number will be a value in the range 1 < κ(A) < ∞. If κ(A) is “close” to 1, the matrix is said to be well-conditioned, while a large condition number means an ill-conditioned matrix. Well- and ill-conditioned is of course a relative term. If we were to change the data representation from for example 16 to 32 bits, we would expect that we are able to invert far more ill-conditioned matrices with satisfactory results. The matrix condition number will be important when analysing the precision of the decompositions. 40 Algorithm evaluation

5.5 Test data generation

Since there have been no real-world test data available during the work with this thesis, test matrices have been generated using Matlab built-in functions. The matrix elements are randomly generated with uniform distribution on the interval [−c,c]. In the case of complex test matrices, the real and imaginary parts have been generated separately. The value c relates to input data scaling previously discussed in section 5.2. c is chosen depending on the algorithm and matrix size so that overflows are avoided in most cases and we get a good utilization of the available number range.

5.6 QR decomposition

We have now discussed most of what we need to know about Sleipnir and fixed- point calculations in order to be able to evaluate the algorithms. For QR decom- position we will derive lower bounds on computation time, discuss the chosen data formats and scaling issues and present an error analysis.

5.6.1 Performance The best clock cycle count achievable for an algorithm is lower bounded by the time it takes to perform all arithmetic operations. The number of arithmetic operations is therefore an interesting number to compare with the actual computation time, to see how much of the time useful operations are performed. The extra cycles spent over this lower bound is overhead that we want to remove as much as possible. Possible sources of overhead are updating loop counters, performing jumps, setting address registers and much more. The number of arithmetic operations in the QR decomposition algorithm is shown for a few matrix sizes in table 5.1. Note that the number of operations for the complex case counts real operations, which means that a complex matrix mul- tiplication has been broken down into four real multiplications and two additions, for example. If we consider a case where we have a processor that can perform a single arithmetic operation per cycle, the real 64×64 QRD could take no less than 260064 + 266240 + 64 = 526368 clock cycles. Since Sleipnir uses vector operations, we can significantly improve on that. By looking at the operation counts, we see that the most used operation is multiplication. We could therefore assume that multiplication would be the bottleneck of the datapath and use that as a rough lower estimate. For 16-bit data we can perform 8 multiplications per cycle for real data and 16 multiplications per cycle for complex data. If we only concern ourselves with the multipliers, 64 × 64 QRD could therefore take no less than 266240/8 = 33280 clock cycles in the real case and 1038304/16 = 64894 clock cycles in the complex case. The method of looking at the multipliers as the most narrow resource could pos- sibly be too simplified. Fortunately, we get a lot of additions and multiplications done simultaneously if we use more advanced instructions, like the TMAC. Some operations however, are more complicated. One such operation is the operation 5.6 QR decomposition 41

Table 5.1. Number of arithmetic operations for QRD algorithm

(a) Real data (b) Complex data √ √ m × n +/- * 1/ x m × n +/- * 1/ x 2x2 5 12 2 2x2 21 32 2 4x4 54 80 4 4x4 214 256 4 8x8 476 576 8 8x8 1884 2048 8 16x16 3960 4352 16 16x16 15736 16384 16 24x24 13524 14400 24 24x24 53844 55296 24 32x32 32240 33792 32 32x32 128496 131072 32 48x48 109416 112896 48 48x48 436584 442368 48 64x64 260064 266240 64 64x64 1038304 1048576 64

A(1:m, j) = A(1:m, j) − R(i, j) ∗ Q(1:m, i) in the QRD algorithm. We have three source operands and one destination operand. No single instruction in Sleipnir can support three source operands (except for the case where the accumulator register is an implicit source operand). This operation therefore has to be broken down into two operations, first the multiplication and then the subtraction. While performing subtraction, we cannot get any multiplications done, thus reducing the utilization of the multipliers. An exact calculation of the lower bound on the computation time has been per- formed with Matlab, by considering what operations could possibly be performed in parallel and which cannot. The results are shown in table 5.2. These numbers will be the reference for comparison with the cycle times actually achieved. With this theoretical number we would get a multiplier utilization of 33280/49536 ≈ 67% in the 64 × 64 real case.

Table 5.2. Lower bounds on cycles times for QRD

(a) Real data (b) Complex data m × n Min. cycle time m × n Min. cycle time 2x2 11 2x2 11 4x4 34 4x4 34 8x8 116 8x8 216 16x16 816 16x16 1600 24x24 2676 24x24 5304 32x32 6272 32x32 12480 48x48 20976 48x48 41856 64x64 49536 64x64 98944 42 Algorithm evaluation

5.6.2 Data formats The input data matrix A as well as the output matrices Q and R have all been chosen to use Q2(1.15) format. Since Q consists of a set of normed base vectors, those elements will always be in the range [−1, 1] and Q2(1.15) format is therefore excellently suited for this matrix. If we study the algorithm carefully, we realize that if we were to decompose cA instead of A (where c is a scalar constant), the results would be Q and cR. With proper scaling of A, we can therefore keep R within the bounds of the Q2(1.15) format. Intermediate√ results can in most cases be kept in Q2(1.15) format. One excep- tion is the 1/ x computation, that will overflow in Q2(1.15) format. Fortunately we are saved by the fact that this value is afterwards multiplied by values that are guaranteed to bring that result back into range. Therefore we can use an extended intermediary 32 bit format for this computation. This can be done without any performance loss in both the real and complex case.

5.6.3 Error analysis A Sleipnir adapted version of the QRD algorithm has been implemented in Matlab and will here be compared with double precision floating-point results. The error will be computed as the element-wise difference between the resulting Q and R matrices according to Qerror = |Q − Qref |

Rerror = |R − Rref | This computation results in two error matrices. An interesting value is the mean error, which can be calculated as the mean value of all elements in the Qerror and Rerror matrices. By generating many test matrices and calculate the errors we can get an idea of how the algorithm behaves. The results of such a test-run is shown in figure 5.1, where the error is plotted against the condition numbers of test matrices A. The test matrices are here real-valued with elements randomly generated in the interval [-0.25,0.25]. The error is plotted on a logarithmic scale as 2−x. We can see that most test matrices give a Q with an average error of around 2−11 to 2−14 and an R with errors from around 2−15 to 2−16. However, if the condition number of the A matrix is large, we can get significantly more inaccurate results. Also the maximum error could be of interest. With the same kind of test matrices we get a result according to figure 5.2. Sometimes it could be of interest to estimate how accurate the computed result might be. Under real-time operating conditions, you of course don’t have the luxury of comparing your result with Matlab. We have up until now resorted to the condition number, but calculation of the condition number is a complex and lengthy task and it is therefore not suited for this. Calculation of the condition number would probably take more time than the actual QR decomposition itself. Investigations have shown that the minimum diagonal value in the computed R matrix

rmin = min(r11, r22, ..., rpp) with p = min(m, n) (5.1) 5.6 QR decomposition 43

Mean error of Q in 8x8 QRD Mean error of R in 8x8 QRD 500 500

400 400

300 300

200 200

Matrix condition number 100 Matrix condition number 100

0 0 10 11 12 13 14 15 12 13 14 15 16 Average error (2-x) Average error (2-x)

Figure 5.1. Average error for QRD of 8 × 8 matrices. Datarange [−0.25, 0.25].

Maximum error of Q in 8x8 QRD Maximum error of R in 8x8 QRD 500 500

400 400

300 300

200 200

Matrix condition number 100 Matrix condition number 100

0 0 6 8 10 12 8 10 12 14 Average error (2-x) Average error (2-x)

Figure 5.2. Maximum error for QRD of 8x8 matrices. Datarange [−0.25, 0.25]

gives us a rough estimate. This is a√ result of the fact that these values are the ones which are a direct result of the 1/ x operations and small values of x introduce the possibility for big relative round-off errors. Small values of x are typically the result of bad scaling or more linearly dependent columns in A. Since the condition number partly is a measure of linear dependence, we realize that the condition number and rmin are at least to some extent related. Finding the value of rmin is of course a much easier task. With the same conditions as in figure 5.1 we generate a new plot where the error is instead plotted against rmin. The result is shown in figure 5.3. In figure 5.4 we do the same thing for a 64 × 64 matrix. The random data is here instead generated in the range [−0.125, 0.125]. We get a decrease in precision compared to the 8 × 8 case as expected, but the errors are probably still manageable for many situations. Finally we investigate the impact of how the input data is scaled. We start by 44 Algorithm evaluation

Mean error of Q in 8x8 QRD Mean error of R in 8x8 QRD

0.2 0.2

0.15 0.15 min min R 0.1 R 0.1

0.05 0.05

0 0 10 11 12 13 14 15 12 13 14 15 16 Average error (2-x) Average error (2-x)

Figure 5.3. Average error for QRD of 8 × 8 matrices. Datarange [-0.25,0.25]

Mean error of Q in 64x64 QRD Mean error of R in 64x64 QRD 0.12 0.12

0.1 0.1

0.08 0.08

min 0.06 min 0.06 R R

0.04 0.04

0.02 0.02

0 0 9 10 11 12 13 13.5 14 14.5 Average error (2-x) Average error (2-x)

Figure 5.4. Average error for QRD of 64 × 64 matrices. Datarange [-0.125,0.125]

generating a matrix A with data elements in the range [−1, 1]. We then decompose cA for different values of c in the range 0 < c ≤ 1. This compresses the values of A to the range [−c, c]. The average errors for Q and R are plotted in figure 5.5. We can see that for larger data ranges, we have a rapid decline in precision for both matrices. This is due to overflows in the R matrix, which results in saturation. If the overflow happens early in the decomposition process (which it often does), the introduced average error will of course be larger, since the saturated value will be used in following iterations. Too small data ranges instead have problems with rounding and loss of significant bits, which introduces larger relative rounding errors, but this is a more manageable problem. We can conclude that a good input data scaling strategy should avoid overflows, since it introduces very large errors. 5.7 LU decomposition 45

Impact of scaling in Q-matrix (8x8) Impact of scaling in R-matrix (8x8) 18 18

16 16

14 14 ) )

-y 12 -y 12

10 10

8 8

6 6 Average error (2 Average error (2 4 4

2 2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Data range [-x,x] Data range [-x,x]

Figure 5.5. Impact of scaling the input matrix A on the precision of the resulting matrices. The input matrix is a 8 × 8 real matrix.

5.7 LU decomposition

The evaluation of the LU decomposition will here be presented in much the same way that the QR decomposition just was.

5.7.1 Performance We are once again interested in the lower bound on computation time. The number of operations for real and complex LU decomposition is shown in table 5.3. Since multiplication again turns out to be the limiting operation, we can calculate the minimum possible cycle time, considering only the multipliers, as 87360/8 = 10290 clock cycles for the real 64 × 64 case and 349566/16 = 21848 clock cycles for the complex 64 × 64 case.

Table 5.3. Number of operations for LUD algorithm

(a) Real data (b) Complex data m × n +/- * 1/x m × n +/- * 1/x 2x2 1 2 1 2x2 6 10 1 4x4 14 20 3 4x4 68 86 3 8x8 140 168 7 8x8 616 686 7 16x16 1240 1360 15 16x16 5200 5470 15 24x24 4324 4600 23 24x24 17848 18446 23 32x32 10416 10912 31 32x32 42656 43710 31 48x48 35720 36848 47 48x48 145136 147486 47 64x64 85344 87360 63 64x64 345408 349566 63

By doing an exact calculation of the minimum possible cycle time we end up with the result that is shown in table 5.4. Here we have assumed that the 1/x 46 Algorithm evaluation operation can be performed in a single cycle as the rest of the operations, which is quite unrealistic. However, since we don’t know anything about the actual cycle time of this operation, we leave it as it is for now.

Table 5.4. Lower bounds on cycles times for LUD

(a) Real data (b) Complex data m × n Min. cycle time m × n Min. cycle time 2x2 4 2x2 7 4x4 18 4x4 27 8x8 70 8x8 130 16x16 445 16x16 808 24x24 1380 24x24 2542 32x32 3131 32x32 5844 48x48 10105 48x48 19200 64x64 23415 64x64 44972

The multiplier utilization for the real 64 × 64 case turns out to be no more than 10290/23415 ≈ 44%. This is much lower than the QRD case, where we had a possible 67% utilization. This is a result of that the LU decomposition algorithm operates on vectors that decrease their lengths by one for every iteration. An example could be that we want to perform a dot product between two real vectors of length five. We still have to use the TMAC instruction which operates on vectors of eight elements. This means that three multiplications and three additions are performed without any impact on the result. This decreases the useful utilization of the multipliers.

5.7.2 Data formats The input data matrix A and the combined LU result matrix both have been chosen to use the Q2(1.15) format. The L matrix is excellently suited to be stored in this format. The main diagonal is all 1’s (which we anyway do not store) and the elements below the diagonal are relative sizes of the these elements in the A matrix to the elements on the main diagonal of the A matrix. Since we perform row pivoting, the value with the largest absolute value will end up in the main diagonal of A. Therefore will the values below the main diagonal end up having a relative size ≤ 1. We conclude that this would not at all be true without row pivoting and we would then face the problem of not knowing how large the relative sizes can be and what data format to choose for L. In much the same way as with QRD, the decomposition of cA (with a constant c), will produce L and cU. It is therefore possible to keep the matrix U in range, by correctly scaling A. Intermediate results are kept in Q2(1.15) format most of the time. An exception to that is the result of the 1/x operation. Like in the QRD case, the following multiplications will bring the data back into Q2(1.15) range so an extended format can be used as an intermediary. 5.8 Matrix determinant 47

5.7.3 Error analysis The obtainable precision of the combined LU matrix will here be investigated. We define the element-wise error matrix as

LUerror = |LU − LUref | where LU is the 16-bit computed result and LUref is the 64-bit floating-point result. Again we are interested in the average error. The results are shown in figure 5.6. The figure depicts decomposition of many real-valued test matrices of sizes 8 × 8 and 64 × 64. Matrix elements have been randomly generated in the ranges [−0.25, 0.25] and [−0.125, 0.125] respectively. The precision typically is better than for QR decomposition. As can be seen from the figure, the average error seems less dependent on the condition number.

Mean error of LU matrix in 8x8 LUD Mean error of LU matrix in 64x64 LUD 300 600

250 500

200 400

150 300

100 200 Matrix condition number Matrix condition number 50 100

0 0 13 14 15 16 11 11.5 12 12.5 13 Average error (2-x) Average error (2-x)

Figure 5.6. Average precision for LUD of 8 × 8 and 64 × 64 real matrices

5.8 Matrix determinant

The calculation of the matrix determinant is performed using LU decomposition followed by a calculation of the product of the diagonal elements of LU and sign compensation. Since the final multiplications and sign compensation is a very small task, it is negligible compared to the LU decomposition. The performance numbers from the LUD-section (5.7) therefore applies to matrix determinant calculation.

5.8.1 Dynamic range handling The main difficulty when evaluating the matrix determinant is the data represen- tation of the actual determinant value. What we want to do is to perform

n s Y det(A) = (−1) uii. i=1 48 Algorithm evaluation

Since |uii| ≤ 1, we can quickly underflow the representable data range, leading to a zero result. Except for zero, the range of |uii| can be anywhere from approximately 0.00003 to 1. If we would calculate the determinant of an 8×8 matrix with |uii|= 6 0, we can end up with 7.52 · 10−37 ≤ |det(A)| ≤ 1. For larger matrices the situation would get even worse. What we really would want is to use some kind of floating- point representation instead, but we have no direct hardware support for this. For many applications, however, the determinant value could represent some physical quantity, which possible magnitudes we have additional knowledge about. This allows us to use a scaling scheme, specific to this application. For now, we end up with a few options:

• Use a 16 bit result and accept that values smaller than 0.00003/2 will be rounded to zero.

• Use a more precise 32 bit result to get a much larger dynamic range. The smallest representable number will then instead be approximately 4.66·10−10.

• Use an on beforehand fixed scaling scheme, which is chosen according to the expected determinant values for your specific application. We can then represent as small values as we would like, but can instead not represent values all the way up to 1.

• Checking the size of the determinant value at regular intervals during the product accumulation and scale if needed. The amount of scaling then has to be stored somewhere.

For the implementation, the second method has been chosen. There is no loss in performance associated with evaluating the result as 32 bits compared to 16 bits. The third method is application dependent and has therefore been omitted. The fourth method would certainly be good from a dynamic range point-of-view. However, the more often we perform scalings, the closer we get to a software floating-point system and the operations involved could take much time. Since this product accumulation however is such a small part of the overall algorithm, this added time might be manageable, especially for larger matrices. It has however been decided that implementing software floating-point falls outside the scope of this thesis.

5.8.2 Error analysis

The determinant algorithm has been implemented using the 16 bit LU decom- position described earlier. The calculated determinant value has been compared with the result from Matlab’s det-function. The result is shown in figure 5.7 for 8 × 8 real matrices. The randomly generated matrices will typically end up with a non-zero value of the determinant for matrices smaller than around 16 × 16. For larger matrices, 32 bits is not enough. 5.9 Matrix inverse 49

Determinant value error (8x8 matrix) 400

350

300

250

200

150

100 Matrix condition number

50

0 24 26 28 30 32 Average error (2-x)

Figure 5.7. Average precision for determinant of 8 × 8 matrices

5.9 Matrix inverse

General matrices will be translated into triangular matrices by using either LUD or QRD. We will therefore mainly concentrate on obtaining lower boundaries on the computation time for inversion of triangular matrices. We will investigate what data formats are suitable and investigate the precision of triangular matrix inversion. Finally we will investigate if LUD inversion or QRD inversion is best for precision when inverting general matrices.

5.9.1 Performance The number of arithmetic operations for inverting a lower or upper triangular matrix is shown in table 5.5. If the matrix instead is uni-triangular we can remove the divisions and some other operations. The number of operations for uni-triangular matrices is shown in table 5.6. As we can see, the difference in the number of operations is very small. Since divisions could take many cycles the performance difference could however be a lot bigger, at least for small matrices. Since the differences between the triangular and uni-triangular cases are so small, we only present the calculated lower bound on cycle time for the uni-triangular case. The results for real matrices is shown in 5.7. For the complex case, the lower bound would roughly double.

5.9.2 Data formats

In contrast to QRD and LUD, we cannot choose a Q2(1.15) format for the inverted matrix in general. Even in the simplest case, uni-triangular inversion, the Q2(1.15) format will typically not suffice. In the regular triangular case, the diagonal elements of the inverted matrix will be results of 1/x operations on Q2(1.15) data, which will definitely overflow. Since we still want our inverted matrix to 50 Algorithm evaluation

Table 5.5. Number of operations for lower/upper triangular inversion

(a) Real data (b) Complex data m × n +/- * 1/x m × n +/- * 1/x 2x2 2 2 2 2x2 8 8 2 4x4 16 16 4 4x4 64 64 4 8x8 112 112 8 8x8 448 448 8 16x16 800 800 16 16x16 3200 3200 16 24x24 2576 2576 24 24x24 10304 10304 24 32x32 5952 5952 32 32x32 23808 23808 32 48x48 19552 19552 48 48x48 78208 78208 48 64x64 45696 45696 64 64x64 182784 182784 64

Table 5.6. Number of operations for lower/upper uni-triangular inversion

(a) Real data (b) Complex data m × n +/- * m × n +/- * 2x2 2 1 2x2 6 4 4x4 16 10 4x4 52 40 8x8 11 84 8x8 392 336 16x16 800 680 16x16 2960 2720 24x24 2576 2300 24x24 9752 9200 32x32 5952 5456 32x32 22816 21824 48x48 19552 18424 48x48 75952 73696 64x64 45696 43680 64x64 178752 174720

Table 5.7. Lower bounds on cycles times for real uni-triangular matrices

m × n Min. cycle time 2x2 1 4x4 6 8x8 28 16x16 148 24x24 424 32x32 920 48x48 2828 64x64 6368 be stored in a 16-bit format, we must choose another format, with more integer bits. There is of course no correct way to do this. The chosen format is a trade-off between precision and how large values you want to be able to represent. We will investigate what impact this has in the next section. In general, however, we can 5.9 Matrix inverse 51 say that a regular triangular matrix will typically need more integer bits, due to the 1/x operations.

5.9.3 Error analysis We will start by investigating the precision of triangular inversion. Like before we are interested in the element-wise error and then compute the mean error. Figure 5.8 depicts the results of inverted triangular and uni-triangular matrices respectively. The inverted matrices are 8 × 8 real matrices. The output format for the triangular case has been chosen to Q2(8.8), while the uni-triangular output is in Q2(4.12) format. The uni-triangular case has high precision, while the triangular case is much worse. This again has to do with the 1/x operations.

Mean error for triangular inversion Mean error for unitriangular inversion 400 400

350 350

300 300

250 250

200 200

150 150

100 100 Matrix condition number Matrix condition number

50 50

0 0 6 7 8 9 10 11 14 15 16 Average error (2-x) Average error (2-x)

Figure 5.8. Triangular matrix inversion results

It is now time to compare the precision of LUD inversion versus QRD inversion. The results are shown in 5.9. The same set of real 8 × 8 test matrices have here been inverted with both methods. The output of the upper triangular matrices U and R both use Q2(8.8) format, while the lower uni-triangular matrix L uses Q2(4.12) format. While both methods have approximately the same precision for very well-conditioned matrices, the LUD inversion method turns out to be much better as soon as the matrix condition number gets larger. If we want better precision we should therefore use the LUD inversion method. Finally we will investigate what impact the output formats of the triangular inversions have on the final result of general matrix inversion. We use the LUD inversion method for this comparison. The U matrix is stored with either 2, 4, 6 or 8 integer bits. The number of integer bits for L is half the number of integer bits in U for each case. The result of such a test is shown in figure 5.10. For a low number of integer bits we get good precision for some matrices, but many matrices fail to invert properly due to overflows. As we increase the number of integer bits, more matrices can be properly inverted, at the cost of accuracy. 52 Algorithm evaluation

Mean error 8x8 A-1 using MGS-QRD Mean error 8x8 A-1 using LUD 400 400

350 350

300 300

250 250

200 200

150 150

100 100 Matrix condition number Matrix condition number

50 50

0 0 0 2 4 6 8 10 0 2 4 6 8 10 Average error (2-x) Average error (2-x)

Figure 5.9. Comparison between LUD and QRD inversion of general matrices

Q (2.14) Q (4.12) 2 2

350 350

300 300

250 250

200 200

150 150

100 100 Matrix condition number Matrix condition number 50 50

0 0 -5 0 5 10 -5 0 5 10 Average error (2-x) Average error (2-x)

Q (6.10) Q (8.8) 2 2

350 350

300 300

250 250

200 200

150 150

100 100 Matrix condition number Matrix condition number 50 50

0 0 -5 0 5 10 -5 0 5 10 Average error (2-x) Average error (2-x)

Figure 5.10. Comparison between different output formats Chapter 6

Implementation

This chapter will discuss the implementation of the previously presented algo- rithms. Rather than providing a single piece of assembly code for each decom- position, that takes the size of the matrix as an input, we will instead provide a way to generate assembly code for a specific size. The generated assembly code will therefore only be capable to run for the specific matrix size that was chosen at generation. The reasoning behind this approach is that we expect a typical application to only use a single (or a few) matrix sizes. It is therefore more important that the code generated is efficient for the size chosen, rather than having a computation kernel that could work for any size. By removing the matrix size as an input parameter to the kernel, we can remove some overhead and possibly use different approaches depending on the matrix size. Rather than presenting the whole code for a particular decomposition we will instead discuss some of the strategies used and what difficulties that might show up. This will help us to identify possible weak spots of the Sleipnir micro- architecture. Straight-forward methods of implementation will be discussed as well as what performance enhancing techniques can be applied.

6.1 Assembly code generation

What we want to do is to generate the assembly code for a specific matrix de- composition, for a matrix of a specific size. There are of course many ways this could be done. The method chosen here is to use Python scripts. A typical script will generate the code for a specific decomposition. The only input needed is the requested matrix size. The script will then return a text string with the final assembly code. The total result of the implementation phase will therefore be a library of Python scripts for the different decompositions.

53 54 Implementation

6.2 Memory storage

Before executions starts in Sleipnir, the kernel will be loaded to the program memory and constants will be loaded to the constant memory. These memories will remain completely unchanged after a kernel computation has finished and a new input matrix could then be supplied and the kernel could be rerun. In order for the kernel to operate correctly, the input data matrix must be stored in a correct way and at the correct position. The kernels will assume that the input matrix is stored from address 0 in the LVM connected to memory port m0. From address 0 the matrix will then be stored in row-major order or column-major order depending on what algorithm it is. In the case of row-major order, the first row should then be stored on address 0 and consecutive addresses. The next row should then start on the next free multiple of 8 memory address. The empty space that will show up in the end of the row, if the row length is not a multiple of 8, should be filled with zeros. This will avoid some problems which we will see later.

6.3 QR decomposition

The starting point of the Sleipnir implementation of QRD is the algorithm that was presented in listing 4.1. The approach will be to translate the Matlab code into the corresponding assembly code. One of the huge advantages with high level language code like Matlab code, is that it hides a lot of the complexity of how the processor actually performs the functions, which saves a lot of development time. It turns out that a lot of the development time spent during assembly programming is spent on figuring out how to store and access data, a problem that is completely hidden in the Matlab code. Some design decisions will depend on what matrix sizes we want to support. Mostly it has to do with how we can utilize the vector register file. The largest result of a QRD that could be kept in LVM memory would be around 140 × 140 for real matrices or around 100 × 100 for complex matrices. It has therefore been decided to limit the implementation to matrices with a maximum size of 64 × 64. This could however easily be extended. The size of the vector register file will greatly affect the performance and what techniques can be applied. We realize that the largest square matrix that could be kept entirely in the vector register file would be an 8×8 matrix for the real case and a 4×4 for the complex case. I have therefore decided to implement special assembly generation scripts for these small sizes. These kernels have the advantage of being able to fit intermediate results of whole computational steps entirely in the VRF. The VRF provides low-latency access to data, which should increase performance. How these smaller sized QR decompositions are implemented will be described in the following section and section 6.3.2 will deal with QRD of matrices with larger sizes. 6.3 QR decomposition 55

6.3.1 Small-size QRD By looking at the code from listing 4.1, we realize that we want to do the same kind of operations over and over again, but with different lengths and input data. For a general case it would be very beneficial if we could create small pieces of code that could be used over and over again. For cases where the matrix however is small, we could instead unroll the whole algorithm as a long sequential program. By doing this, we can remove some overhead at the cost of some program size. This have been the approach tested here. For small matrices, we can fit many intermediate results completely in the VRF. The flow chosen for the QRD algorithm is shown in figure 6.1. The figure indicates which data is used to produce a next set of data and also where data is fetched and stored. Some steps just indicate data copy. Each sub-task has possible parallelism but we have data dependencies between the different steps. Since data dependencies will typically degrade performance by requiring stalls, it is important that each sub-task is sufficiently large.

Start A(1:m,i)[m0] → R(i,i)[vrf]

A(1:m,i)[m0], R(i,i)[vrf] → Q(1:m,i)[m1]

R(i,i)[vrf] → R(i,i)[m1]

A(1:m,j)[m0], Q(1:m,i)[m1] → R(i,j)[vrf] (1)

R(i,j)[vrf], Q(1:m,i)[m1] → Tmp[vrf] (2)

A(1:m,j)[m0], Tmp[vrf] → Tmp2[vrf] (3)

R(i,j)[vrf] → R(i,j)[m1] (4)

Tmp2[vrf] → A(1:m,j)[m0] (5)

Done?

End

Figure 6.1. Computation flow-chart for small QRD

Since the processor currently does not stall the pipeline to resolve the data dependencies, no-operation instructions (NOP) have to be inserted in the code. Listing 6.1 shows a part of the code generated for iteration 7 out of 8 for QRD of an 8 × 8 matrix. The commented numbers correspond to the computational steps in figure 6.1. Since this is the second last iteration of the algorithm, there is very little to do. Each step uses only a single instruction. We write intermediate results to the low-latency VRF, but still need to insert many NOPs. Even though this is kind of a worst case scenario, it will be a problem for decompositions of smaller matrix sizes. The problems described here will be further discussed in chapter 8. 56 Implementation

Listing 6.1. Example code for 7th iteration of an 8 × 8 real QRD

1 tmac vr0.7 m1[ar1].vw m0[ar0+=8].vw // 1 2 vcopy vsr0 cm[car0+=1] // Update register ar0 3 5 ∗ nop

5 vsmul vr7 m1[ar1+=8].vw vr0.7 // 2 6 6 ∗ nop

8 vsubwvr7m0[ar0+=8%].vwvr7 //3 9 4 ∗ nop

11 vcopym1[ar2+=8].vwvr0 //4 12 vcopym0[ar0+=8%].vwvr7 //5

6.3.2 Large-size QRD When moving to larger matrix sizes, intermediate results can not longer fit in the register file. Unrolling all code in a long sequence would also produce a lot of code, which would reduce task loading time to Sleipnir and probably not even fit in the program memory. Listing 6.2 shows how eight consecutive dot products could be performed, each time operating on two 64 element vectors. The repeat instruction tells Sleipnir to perform the following four instructions eight times. A problem with the algorithm however is that each time the number of times an operation should be performed changes, so the exact same code can’t be reused. This could be solved using a counting register and then do conditional jumps depending on the contents of that particular register, but this kind of software looping would introduce much overhead. Instead we have implemented a hardware loop scheme, where the number of iterations will be determined by a special loop register. This is provided by the repeatr instruction. Before entering the hardware loop, we just set the loop register to a value of our choice and the hardware takes care of how many times it has performed the loop by itself. Considering that the TMAC instructions each perform eight multiplications and eight additions, this hardware loop will perform an average of 7.52 arithmetic operations per cycle.

Listing 6.2. Eight consecutive dot products of vectors with 64 elements by the use of the repeat instruction

1 repeat 8 4 2 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 3 tmaco vr0.0 m0[ar0+=8%].vw m1[ar1+=8%].vw 4 8 ∗ nop 5 scopyw m1[ar2+=1%].sw vr0.0

Just by looking at the Matlab algorithm from listing 4.1, one could expect that the most intensive operation in the algorithm is the A(1:m, j) = A(1:m, j)−R(i, j)∗ Q(1:m, i) operation. It is very important that this operation can be performed efficiently. Listing 6.3 shows the generated code performing this operation for 6.3 QR decomposition 57 a 64 × 64 real matrix. An iteration in the loop takes care of one column in the A matrix. There are a few things worth noting here. The NOP on line 3 is used for resolving the data dependency between lines 2 and 4. The NOPs on line 12 however, are not related to data dependencies. Instead it has to do with the fact the vsmul and vsubw instructions use different pipeline lengths. Both have long data inputs and short data outputs, but vsmul is a long datapath operation and vsubw is short. Without the NOPs, some vsubw instructions would catch up with some vsmul instructions and would want to use the last ALU stage simultaneously. Another source of problems is that a single instruction may not use the same memory two times, since this would cause memory conflicts, without more expensive dual-port memory. This forces the vsubw instructions to write their results back to the register file, followed later by copy-back.

Listing 6.3. Calculation of new A(1:m,j) values

1 r e p e a t r 27 2 scopyw vr7.0 m1[ar2+=1%].sw 3 1 ∗ nop 4 vsmul vr0 m1[ar1+=8%].vw vr7.0 5 vsmul vr1 m1[ar1+=8%].vw vr7.0 6 vsmul vr2 m1[ar1+=8%].vw vr7.0 7 vsmul vr3 m1[ar1+=8%].vw vr7.0 8 vsmul vr4 m1[ar1+=8%].vw vr7.0 9 vsmul vr5 m1[ar1+=8%].vw vr7.0 10 vsmul vr6 m1[ar1+=8%].vw vr7.0 11 vsmul vr7 m1[ar1+=8%].vw vr7.0 12 3 ∗ nop 13 vsubw vr0 m0[ar0+=8%].vw vr0 14 vsubw vr1 m0[ar0+=8%].vw vr1 15 vsubw vr2 m0[ar0+=8%].vw vr2 16 vsubw vr3 m0[ar0+=8%].vw vr3 17 vsubw vr4 m0[ar0+=8%].vw vr4 18 vsubw vr5 m0[ar0+=8%].vw vr5 19 vsubw vr6 m0[ar0+=8%].vw vr6 20 vsubw vr7 m0[ar0+=8%].vw vr7 21 vcopy m0[ar0+=8%].vw vr0 22 vcopy m0[ar0+=8%].vw vr1 23 vcopy m0[ar0+=8%].vw vr2 24 vcopy m0[ar0+=8%].vw vr3 25 vcopy m0[ar0+=8%].vw vr4 26 vcopy m0[ar0+=8%].vw vr5 27 vcopy m0[ar0+=8%].vw vr6 28 vcopy m0[ar0+=8%].vw vr7

6.3.3 Loop unrolling After a straight-forward implementation has been made, it is time to look at different techniques to increase performance. One very common and effective 58 Implementation approach is to use loop-unrolling. The code from listing 6.2 could for example be changed into that of listing 6.4. Each iteration in the loop would now instead process eight dot products instead of one. By doing this we decrease the percentage of overhead for the NOPs and copy-back. The average number of arithmetic operations per cycle will increase from 7.52 to 14.03. A combination of this unrolled loop and that from listing 6.2, can be used when the number of dot products to compute is not a multiple of eight.

Listing 6.4. Unrolled dot product accumulation

1 r e p e a t r 18 2 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 3 tmaco vr0.0 m0[ar0+=8%].vw m1[ar1+=8%].vw 4 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 5 tmaco vr0.1 m0[ar0+=8%].vw m1[ar1+=8%].vw 6 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 7 tmaco vr0.2 m0[ar0+=8%].vw m1[ar1+=8%].vw 8 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 9 tmaco vr0.3 m0[ar0+=8%].vw m1[ar1+=8%].vw 10 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 11 tmaco vr0.4 m0[ar0+=8%].vw m1[ar1+=8%].vw 12 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 13 tmaco vr0.5 m0[ar0+=8%].vw m1[ar1+=8%].vw 14 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 15 tmaco vr0.6 m0[ar0+=8%].vw m1[ar1+=8%].vw 16 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 17 tmaco vr0.7 m0[ar0+=8%].vw m1[ar1+=8%].vw 18 8 ∗ nop 19 vcopy m1[ar2+=8%].sw vr0

6.3.4 Removing control-overhead

All small code loops have to be controlled so that they perform the correct function. Address registers must be set before entering the loops, so that they operate on the correct data and loop counters must be set to their correct values. Calculating the correct values is in most cases a sequential task, which will not use the available hardware efficiently. Since the number of iterations in each step is determined by the algorithm we could precalculate this and put all data in a look-up table. This will greatly reduce the clock cycle count, since transferring a few extra vectors at task loading takes much less time than doing run-time calculations. The look-up table will be placed in the constant memory, so that this can be retained between different kernel runs. A drawback is that the constant memory is not word addressable, which makes it more difficult the read out single words. A solution is to transfer the look-up table to an LVM memory in the beginning of the algorithm program code. 6.4 LU decomposition 59

6.3.5 Using a memory ping-pong approach

In the straight-forward implementation, the input matrix comes from memory port m0 and the matrices Q and R are stored with memory port m1. By analyzing the code from listing 6.3, we realize that if matrix A instead would be written back to m1, we could do that directly in the vsubw instruction, since m1 is not already used in this instruction. All following vcopy instructions could then be removed. In the next major iteration the remaining part of A would then reside in m1. To manage the next iteration effectively the next Q vector generated should then be stored in m0, to be able to perform the dot product with vectors of A efficiently. This procedure can be arranged so that matrices A and Q use shared memory spaces in both memories without overwriting any wanted data. The situation is depicted in figure 6.2. As can be seen from the figure, the drawback is that the result Q will be distributed in both m0 and m1. The matrix will have to be merged in the end. This merging operation is however much faster than the overhead involved by not using ping-ponging.

Figure 6.2. Matrix storage distribution when using the memory ping-ponging technique

6.4 LU decomposition

For LU decomposition, we will use the same implementation approach as in the QRD case just described. Many of the same concepts will be applied also here. LU decomposition however introduces more challenges that will have to be taken care of. Rather than repeating much of the previous discussion, we will only focus on the additional issues that shows up. 60 Implementation

6.4.1 Row pivoting As described in section 4.2.3, row pivoting involves searching the m × n matrix A for

max(|akk|, |a(k+1)k|, ..., |amk|) where k is the current iteration in the algorithm. The row with the maximum value should then be interchanged with the kth row. The actual row swapping could be done in several ways. One way of course is to physically move the data. Another way would be to store a list memory pointers to each row and just switch the pointers. This list would then have to be stored in memory somewhere. Each time we want to access a particular row, we then have to look in this list, requiring an extra memory read. Since the rows in the matrix are not very long and we can move eight elements at a time, the first method has been adopted. In order for the row pivoting procedure to be efficient, an efficient way of finding the maximum pivot element must be implemented. We must also keep track of what index this value has, so that we can interchange the correct rows. It turns out that the conditions concept that was described in section 3.5.1.2 can be used for this. We keep a vector of the eight maximum values found this far and compare them element-wise with a new vector. We also keep a vector with the corresponding indexes to the eight maximum candidates. We can then conditionally overwrite the index vector with new indexes, in those positions where the new values are larger. When we are done and left with eight candidates, we find the maximum of all these, which partly is solved by the addition of a new maximum instruction.

6.4.2 Datapath lane masking Sleipnir will in a typical real data case, operate on eight 16 bit elements at a time. A problem then arises when we want to operate on vectors that are not a multiple of eight. We must make sure that we don’t read unwanted data and don’t overwrite data that belonged to something else. In the QRD case, this was easily avoided by just placing zeros outside the matrix where this could be a problem. In the LUD case, the access patterns of the A(i, j) = A(i, j)−A(i, k)∗A(k, j) operation further complicates this task. It can however be solved using the conditions concept that was described in section 3.5.1.2. By loading a masking pattern from the constant memory, we can avoid writing back data in unwanted positions, using conditional write-back with the flag register. This however introduces an additional source of overhead, which possibly could be improved.

6.4.3 2D addressing One major problem with the LUD algorithm is that it requires vector access to both rows and columns. With a straight-forward storage scheme, like the one presented in section 3.4.3.2, this will not be possible due to memory bank conflicts. There are however storage patterns that allows for both parallel row and column read. These could be used in combination with permutation patterns in the constant memory. However, it turns out that this requires complicated addressing calculations and 6.5 Matrix determinant 61 setting of address registers and constant address registers, especially when row pivoting must be taken into account. The choice has instead fallen on the straight- forward storage scheme. The choice between whether row-major or column-major storage should be used is not at all clear. Both methods have therefore been tested and it turns out that row-major is preferable. Part of it has to do with the fact that the row accesses required in the algorithm, uses the whole row, while the column accesses needed decrease with one element per algorithm iteration. The most intensive operation however, the A(i, j) = A(i, j) − A(i, k) ∗ A(k, j) operation, can be arranged effectively for both row-major and column-major storage. Since the problem of wanting to retrieve both rows and columns is very common in many applications, it has been decided that hardware support for general 2D accessing should be investigated. The hardware will then perform permutations completely transparent to the program. This functionality will be added to the next simulator version and have therefore not yet been tried.

6.5 Matrix determinant

The matrix determinant algorithm is practically a slightly modified LUD algo- rithm. The LUD algorithm has been modified so that it keeps track of how many row interchanges is actually performed, followed by the final product accumulation and sign compensation as described in section 4.3.1. The main issue here is that forming a product of many numbers at least to some extent has to be done sequentially, which will require several NOPs to resolve the data dependencies.

6.6 Matrix inverse

Like the earlier described algorithms, the matrix inversion algorithm has also been evaluated for both smaller and larger matrices. We will start by discussing the triangular matrix case, before we move on to the general case.

6.6.1 Triangular matrix inversion Inversion of small triangular matrices, especially uni-triangular ones, turns out to be quite simple. Conceptual code for inversion of a uni-triangular 8 × 8 matrix (with some addressing details removed) is shown in listing 6.5. Vector register vr4 is loaded with the constant 1, that will be inserted in the main diagonal of the resulting matrix, as the algorithm progresses. The matrix multiplication is performed with the vsmac instruction. A problem here is that we don’t really have to perform eight MAC operations every time. We rather want to do two vsmacs in the first loop iteration, three in the second and so forth. A solution to this situation would be to introduce the possibility to perform single instruction iterations, depending on the value of a register. Without this kind of solution, we will introduce a lot of overhead, especially for larger matrices. Due to time limits, this has however not yet been tested. 62 Implementation

Listing 6.5. Conceptual uni-triangular inversion for a real 8 × 8 matrix

1 vcopy vr4 cm[ONE]

3 // First iteration 4 scopyw m0[ar3+=s ].sw vr4.0

6 // Second and following iterations 7 repeat 7 4 8 8 ∗ vsmac vacr m0[ ar1+=8%].vw m1[ ar0+=1].sw 9 6 ∗ nop 10 macoutw m0[ ar2+=8].vw 11 scopyw m0[ar3+=s ].sw vr4.0

6.6.2 General matrix inversion For general matrix inversion, the LUD approach has been tested. The process is performed as depicted in figure 6.3. The main issue here is to make sure that each step stores its results in such a way, that the next step can be performed without too much overhead. As can be seen in the figure, the chosen scheme results in a memory copy stage, so that the following matrix multiplication can be performed efficiently. One can note that the matrix multiplication involves two triangular matrices, which means that this multiplication can be simplified compared to a general matrix multiplication. Finally we need to perform some column interchanges, due to the fact that we’ve performed row pivoting the the LU decomposition.

Start A[m0] → PA = LU[m0] L[m0] → L-1[m0]

U[m0] → U-1[m0]

U-1[m0] → U-1[m1]

Column interchanges U-1[m1] * L-1[m0] → End (PA)-1[m0] → A-1[m1] (PA)-1[m0]

Figure 6.3. Flow for general matrix inversion using the LUD approach

6.7 Verification

All implemented algorithms have been tested with simulation, to verify that they give the correct result. This has been done by generating assembly code for many different matrix sizes, providing test matrices from Matlab and then compare the simulated computation results with the Matlab calculated results. All imple- mented algorithms have been proven to work. Chapter 7

Results

After successful implementation of the algorithms, it is now time to look at the results. The cycle costs presented here will be compared to the theoretical lower boundaries that was derived in chapter 5. To see what parts of the algorithms take the most time, we will also analyze the code with dynamic code√ profiling. An important note is that the performance numbers for the 1/x and 1/ x-operations are unknown. They have for the purpose of testing been implemented like any other long datapath instruction. In a final implementation, they will probably take considerably more time, possibly 10 to 20 clock cycles.

7.1 QR decomposition

The execution time results for small-sized QRD is shown in table 7.1. Compared to the theoretical numbers from table 5.2, we can see that we have a lot of overhead. This is not at all surprising though, since the data dependencies impacts performance a lot here. Another problem is that we in many cases cannot use the entire width of the datapath, which reduces useful utilization. Compared to the theoretical numbers, addressing will also consume extra cycles. It’s interesting to note that the complex data case doesn’t take any extra time at all compared to the real case, which is due to the fact that we only utilize a small fraction of the datapath for real matrices with such small sizes. When moving to larger matrix sizes, the benefits of the wide datapath really shows up. Table 7.2 shows the results for real matrices and 7.3 the same for com- plex matrices. In the 64×64 case, we perform on average 6.59 and 14.02 arithmetic operations per cycle, for the real and complex case respectively. Compared to the lower limits on execution time, 49536 and 98944, for the 64 × 64 case, the kernels are quite efficient. Still, we have some amount of overhead. The data dependency issue is less of a problem when the matrices gets larger, but we still have to insert NOPs to avoid pipeline hazards. Addressing and program flow control also use quite a lot of resources. We conclude the QRD results by looking at some profiling results. Table 7.4 shows how much time is spent on the different operations in the algorithm. The

63 64 Results

Table 7.1. Clock cycle count for small-size QRD

(a) Real data (b) Complex data m × n n=2 n=3 n=4 n=6 n=8 m × n n=2 n=3 n=4 m=2 78 89 94 104 114 m=2 78 89 94 m=3 78 126 141 159 177 m=3 78 126 141 m=4 78 126 178 210 236 m=4 78 126 178 m=6 78 126 178 294 342 m=8 78 126 178 294 426

Table 7.2. Clock cycle count for large real QRD

m × n n=16 n=24 n=32 n=48 n=64 m=16 3053 4222 5380 7696 10012 m=24 3543 6714 9459 14931 20403 m=32 3943 7602 12389 21254 30110 m=48 5308 10851 18322 39048 61872 m=64 6108 12627 21458 46056 79902

Table 7.3. Clock cycle count for large complex QRD

m × n n=8 n=16 n=24 n=32 n=48 n=64 m=8 1125 1762 2392 3022 4282 5542 m=16 1307 3801 6122 8438 13070 17702 m=24 1588 5166 10744 16588 28268 39948 m=32 1796 5966 12520 21458 40166 58870 m=48 2464 8646 18556 32194 70654 114786 m=64 2880 10246 22108 38466 84670 148858

final A matrix update turns out to be the major contributor to execution time. The loop that performs this operation contains several NOPs, which becomes a quite large source of overhead. In fact, it turns out that 29% of the instructions issued in the entire decomposition is NOPs. For smaller matrices this of course gets even worse. The 8 × 8 real decomposition kernel, for example, issues NOPs 62% of the time.

7.2 LU decomposition

We will now turn our attention to the LU decomposition results. Cycle costs for small-sized LU decomposition is shown in table 7.5. If we compare tables 5.2 and 5.4, which lists theoretical lower boundaries on execution time for QRD and LUD, we would have expected the LU decomposition to be performed faster. 7.2 LU decomposition 65

Table 7.4. Code profiling for 64 × 64 complex QRD. Percentages of total run-time.

Operation Run-time % R(i,i) = sqrt(sum(abs(A(1:m,i).ˆ2))) 1.38% Q(1:m,i) = A(1:m,i) / R(i,i) 0.99% R(i,j) = Q(1:m,i)’ * A(1:m,j) 25.74% A(1:m,j) = A(1:m,j) - R(i,j) * Q(1:m,j) 70.76%

In reality however, it is not. One reason is that row pivoting is not considered in the theoretical numbers, which is one source of overhead. The vector search algorithm contains several data dependencies, which degrades performance. Also not considered in the theoretical numbers, is the fact that we cannot do vector accesses of both rows and columns.

Table 7.5. Clock cycle count for small-size LUD

(a) Real data (b) Complex data m × n n=2 n=3 n=4 n=6 n=8 m × n n=2 n=3 n=4 m=2 80 80 80 80 80 m=2 94 94 94 m=3 156 156 156 156 156 m=3 184 184 184 m=4 168 238 238 238 238 m=4 196 280 280 m=6 192 274 350 420 420 m=8 216 310 398 556 626

When moving to larger matrices, the row pivoting procedure becomes a smaller and smaller part of the overall execution time. The A(i, j) = A(i, j) − A(i, k) ∗ A(k, j) operation instead starts to dominate. The execution time results are shown in figures 7.6 and 7.7, for the real and complex cases respectively. Compared to the theoretical numbers from table 5.4 we are still a long way from good utilization. We will of course never come really close to the theoretical numbers, since row pivoting is not included. Still there is room for many optimizations. General 2D addressing, for example, could remove an estimated 6000 clock cycles in the 64×64 real case.

Table 7.6. Clock cycle count for large real LUD

m × n n=16 n=24 n=32 n=48 n=64 m=16 3554 4538 5522 7490 9458 m=24 5032 7642 9456 13084 16712 m=32 6435 10216 13950 19742 25534 m=48 9136 14950 21308 34591 45971 m=64 11900 19768 28621 48421 69320 66 Results

Table 7.7. Clock cycle count for large complex LUD

m × n n=8 n=16 n=24 n=32 n=48 n=64 m=8 1435 2177 2919 3661 5145 6629 m=16 2382 5154 7152 9150 13146 17142 m=24 3210 7361 11893 15567 22915 30263 m=32 4101 9584 16011 22807 34515 46223 m=48 5778 13841 23751 35046 59869 82817 m=64 7518 18224 31659 47382 83907 124026

Finally we present some profiling results. This is shown in table 7.8. This has been done for the 64 × 64 complex case. The final update of the matrix A turns out to be the most demanding task. It could be interesting to note how similar this operation is to the most intensive task of the QR decomposition.

Table 7.8. Code profiling for 64 × 64 complex LUD. Percentages of total run-time.

Operation Run-time % Vector searching 9.00% Row swap 5.28% A(i,k) = A(i,k)/A(k,k) 8.37% A(i,j) = A(i,j) - A(i,k)*A(k,j) 76.17%

7.3 Matrix determinant

The clock cycle count for the real matrix determinant kernels is shown in table 7.9. These results don’t bring so much new information. Keeping track of the number of row interchanges adds a few cycles to the LU decomposition time. The final product accumulation and sign compensation are the only things added to this. The product accumulation can be further optimized, by doing more multiplications in parallel, which hasn’t been done here.

Table 7.9. Clock cycle count for real-valued determinant calculation

Size 2x2 4x4 6x6 8x8 16x16 32x32 64x64 Cycle time 117 305 519 755 3842 14528 70468

7.4 Matrix inverse

Finally we study the results of the matrix inversion kernels. We have previously identified some issues, that prevented efficient implementation of triangular matrix 7.4 Matrix inverse 67 inversion kernels, especially for larger sized matrices. More work has to be done here, to obtain reasonable results. We will therefore concentrate on the small-sized kernels. Results are shown in tables 7.10 and 7.11 for the triangular and uni- triangular cases respectively. Removal of the divisions and some multiplications in the uni-triangular case, significantly speed things up. We can note that the complex triangular inversion is slower than the real inversion even for small matrix sizes, which is due to the complex denominator in the reciprocal calculation.

Table 7.10. Clock cycle count for triangular inversion

(a) Real data (b) Complex data Size Cycle time Size Cycle time 2x2 44 2x2 70 3x3 66 3x3 117 4x4 90 4x4 166 6x6 144 8x8 206

Table 7.11. Clock cycle count for uni-triangular inversion

(a) Real data (b) Complex data Size Cycle time Size Cycle time 2x2 27 2x2 27 3x3 39 3x3 39 4x4 53 4x4 53 6x6 87 8x8 129

We conclude this chapter by looking at the results for general matrix inversion. The execution time results are shown in figure 7.12 for the real case. From an evaluation point-of-view these results don’t add much. We can see that the results are what could be expected, the time for LUD, and two triangular inversions, plus some additional time for matrix multiplication and column swapping. Since the LUD turned out to be slower than QRD for small matrices, we can conclude that QRD inversion would be faster, especially since we only have to do one triangular matrix inversion. Precision would however suffer, which we concluded in section 5.9.3.

Table 7.12. Clock cycle count for general real-valued inversion using LUD

Size 2x2 3x3 4x4 6x6 8x8 Cycle time 202 336 476 774 1096

Chapter 8

Architecture Evaluation

One of the main motivations for investigating these matrix operations has been to evalute the Sleipnir micro-architecture and identify possible problems that degrades performance. Sleipnir has many strengths, for example its highly con- figurable and parallel datapath as well as flexible features for accessing memory. However, we have identified some issues that turns out to be severe bottlenecks in matrix computations. This chapter will discuss these issues and what possibly could be done about them.

8.1 Data dependency issues

One of the larger overheads, especially for matrix operations on smaller matrices, is due to data dependencies. Operations for program flow and addressing tend to also have issues with data dependencies. The problem with data dependencies is that we have to wait for several clock cycles before an operation has been performed and the result has been written to the register file. Not before then can the result be used as an input to another instruction. A technique that is used in many processors to remove this limitation, is to feed back the computed result directly into the datapath, so that the result doesn’t have to have been written into the register file before reading. This is commonly known as register forwarding. If Sleipnir is going to be an efficient processor for smaller sized matrix computations, register forwarding is one thing that should be considered. This addition could possibly make a huge impact on other applications as well.

8.2 Pipeline hazards

A second major source of overhead occurs when the innermost loops uses in- structions with different pipeline lengths. This was the reason that NOPs had to be inserted in the code, even when the matrix sizes grew larger. We will here demonstrate the issue with a simple example. Figure 8.1 depicts the execution of two long datapath instructions, followed by two short datapath instructions. The

69 70 Architecture Evaluation instructions use short inputs and outputs, meaning that operands and results are read and written to the register file. If the instructions are issued consecutively, we end up with a structural hazard, that occurs because the shorter datapath instructions catch up with long datapath instructions. The figure also depicts two possible solutions. The first solution (i) is to insert NOPs, which delays the issuing of the short datapath instructions. This is the solution we have resorted to in the implementations. A better approach is for the hardware to automatically detect the hazard and stall the pipeline, after the short datapath instructions have progressed so far that they are about to enter the datapath (ii). An alternative approach to achieve the same result would be to provide the possibility for the programmer to execute short instructions as long instructions.

I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 Structural hazard I1 I2 A4 S1 D4 A4 I1 I2 A4 S1 D4 A4

I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D4 A4 I1 I2 A4 S1 D4 A4

I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D4 A4 I1 I2 A4 S1 D4 A4

Figure 8.1. Issuing of long datapath instructions followed by short, causing structural hazard. The hazard can be resolved either by inserting NOPs in the code (i) or by letting the hardware stall the pipeline (ii).

Looking only at figure 8.1, the benefit of approach (ii) is not apparent. If we however would loop several times over this code, the benefit shows up. This is depicted in figure 8.2. When approach (ii) is used, it is possible to issue the long datapath instructions belonging to the next loop iteration earlier. This benefit does not only show up in loops, of course, but rather anywhere where a mix of different-length operations are performed after each other.

8.3 Buffered memory write-back

By looking at listing 6.2 we can identify another source of overhead. Since we can’t use the same memory operand two times in the same instruction, we are forced to temporarily store the result in the vector register file. The waiting overhead could of course be removed by starting the computations of the the next iteration and then write the result to memory later, when it has been written to the register file. Since the number of dot products we want to compute vary between different iterations, handling this would introduce other overheads. What we really would 8.3 Buffered memory write-back 71

I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D4 A4 I1 I2 A4 S1 D4 A4 (i) I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4

I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D4 A4 (ii) I1 I2 A4 S1 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4 I1 I2 A4 S1 D1 D2 D3 D4 A4

Figure 8.2. Benefit of stalling pipeline (ii) over inserting NOPs (i).

want to be able to do is to write the code like in listing 8.1. Memory port m1 would then require two accesses in some cycles, to accommodate both one read and one write.

Listing 8.1. Optimized dot product loop requiring simultaneous read and write access to the same memory

1 r e p e a t r 2 2 7 ∗ tmaca vacr.0 m0[ ar0+=8%].vw m1[ ar1+=8%].vw 3 tmaco<. . .> m1[ ar2+=1%].sw m0[ ar0+=8%].vw m1[ ar1+=8%].vw

A solution to the problem would of course be to use a dual-port memory, but this is not a hardware efficient solution. We can however note that we don’t really care when the memory writes are performed, just that they will happen eventually. Memory writes could then instead be queued in a FIFO buffer and will be performed whenever there is no read access in progress to that particular memory. If the buffer gets full, we would stall the pipeline to write to memory. An additional issue remains with the buffered approach. Line 3 in listing 8.1 would issue a scalar memory write to the write-back buffer. Compared to an unrolled case, where we perform several dot products and then write a complete vector to memory, this would be inefficient. A better write buffer would however be able to check memory bank dependencies between the different memory writes issued and write as much data as it possibly could simultaneously. We conclude the discussion here with some performance estimates. To compute 64 dot products in way similar to that of listing 6.2, we would need 17 ∗ 64 = 1088 clock cycles. The unrolled code from listing 6.4 could do the same thing in 71∗8 = 584 clock cycles. The unrolled code could however not handle other cases when the number of dot products is not a multiple of eight. The code from 8.1 with scalar buffered write-back would consume 8∗64+64 = 576 clock cycles. This code could efficiently handle any other number of dot products as well. In the “smart” buffer case the cycle time could be further reduced to 8 ∗ 64 + 8 = 520 clock cycles, if the buffer is large enough to group together eight consecutive scalar writes into a single vector write. An additional benefit with the buffered write is that if the 72 Architecture Evaluation buffer is large enough, we have some writes left in the buffer at the end of the loop. The write-back of these can be performed simultaneously with the code following the loop, provided that this code does not use the same memory. If the code following the loop is dependent on the results from the loop, we must make sure that the buffer has time to empty. This will have to be done by executing code not dependent on that memory or by inserting NOPs. In an attempt to avoid the addition of a write-back buffer, other options have been investigated. The investigations have shown that it in many situations is possible to schedule the issuing of instructions in such a way, that by inserting gaps in the instruction issuing, memory write-back can be performed in these gaps. If we allow that a single instruction can use the same memory operand both for read and write-back, we just have to make sure that the gap is inserted when it is time for memory write-back. The gap will be filled with a NOP instruction. This would allow us to get the same performance boost as with the write-back buffer, without the extra hardware cost. In some cases it might be hard to schedule the instruction issuing in such a way that this is possible. It would then be possible to instead stall the pipeline to perform the write-back and afterwards perform the read.

8.4 Vector register file size

To minimize the overhead of data dependent operations, we typically perform loop unrolling. This means that we store intermediate values in the vector register file. Removing this overhead completely could require several unrolled cycles, which would fill up a significant portion of the register file, leaving little space for other data. A suggestion would be to at least double the register file size. This would also bring other benefits, since we can have low-latency access to more data. The benefit of course would have to be tested with more benchmarking.

8.5 Operations on datapath-unaligned vectors

Operations on vectors with a length that is not a multiple of the datapath width turns out to be another source of overhead. It requires us to first issue an instruction that sets the flag register and then use this as mask. A faster solution could be to be able to set the flag register directly. This would decrease the time from when the flag-setting operation is issued, until the flags are actually set. An even better way could be to specify the actual data width in the instruction and let the hardware auto-mask the write-back. This is an issue that could use some further investigation. While operating on data vectors that are wider than the datapath and the vector register is involved as either operand or destination, we end up with an ad- ditional source of overhead. This overhead however is not related to performance, but rather to program memory requirements. The code that was presented in listing 6.3 quite clearly demonstrates the issue. For memory operands we have the ability to progress to the next vector elements by using post-increment. For 8.6 Hardware accelerated searching 73 register file operands, this can currently not be done. Instead we have to insert a new line of code for each individual access to the vector register file. If the register file instead would be considered more like linear memory, we could conceptually, for example, formulate rows 13 to 20 from listing 6.3 into just

8 * vsubw vr0+ m0[ar0+=8%].vw vr0+ where vr0+ would instruct the hardware to start from vr0, and in the following iterations continue with vr1, vr2, etc. This could potentially make the program size much smaller. The concepts just described could be further combined and generalized. Instead of instructing the hardware to perform a vector operation a specific number of times, we could instead instruct it to perform a scalar operation on y data elements. The hardware would then automatically vectorize the execution and possibly auto- mask the last iteration, if y is not a multiple of the datapath width.

8.6 Hardware accelerated searching

A set of instructions for accelerating different search patterns could definitely speed things up. Further investigations have to be done to identify the needs of targeted application areas. The LUD kernel has demonstrated the need for maximum absolute value searching on a vector, with corresponding relative index.

8.7 Register dependent iterations

The single cycle iterations concept could be generalized, so that it can accept a special register value as a basis for how many iterations to perform. The number of iterations we want to perform in the matrix inversion kernels decreases with one with every iteration. To further speed this up, post-decrement on this register could be used.

8.8 Complex division

Complex division takes additional time compared to real division, something that the complex LUD and inversion kernels rely on. A complex division can be written as a + bi (a + bi)(c − di) = = (a + bi)((c − di) · 1/|c + di|2) = (a + bi) · γ c + di |c + di|2 By writing the computation like this we realize that we have to perform one absolute squared operation, one real reciprocal calculation and two multiplications. In the LUD and inversion kernels a + bi is typically not a scalar, but rather a long vector of values. Before this vector can be multiplied with γ, the value of γ must be calculated. If all operands and results are read and written to the vector register file, it takes 20 cycles plus the time of the real reciprocal calculation, to compute 74 Architecture Evaluation the value of γ, which is due to data dependencies. Register forwarding will of course reduce the overhead, but it could be beneficial to investigate if this could be accelerated by some custom microcode.

8.9 Simultaneous constant memory accesses

During the work with this thesis there has been some discussion about whether it should be possible for a single instruction to use the constant memory more than once, something the simulator currently allows. In hardware this would require two constant memories with equal memory contents or a dual-port memory. The decision regarding this will be made according to the needs of investigated applications. The implementations made in this thesis have not required any simultaneous accesses to the constant memory. Chapter 9

Conclusions

The ePUMA architecture is indeed a very interesting concept. With eight highly configurable parallel co-processors, the performance potential for DSP applications is very good. One of the key challenges will be to make sure that all processing power can be effectively utilized. This will require innovations in many areas and one of the most important parts will be the tool-chain. Efficient scheduling of memory transfers and kernel execution will be one of the most important aspects of programming for the architecture. Sleipnir has been proven to be an efficient design for many things that have been investigated prior to this thesis. The raw computational power due to the datapath is exceptional. The matrix operations however puts additional strain on Sleipnir’s micro-architecture. The major problems involve data dependencies and pipeline hazards that occur when instructions with different pipeline lengths are issued consecutively. The investigations done as a part of this thesis indicates that some additional architectural modifications can be done to improve the situ- ation significantly. These modifications should be seriously considered, if matrix operations ever are going to be effectively executed. Another issue involves the efficiency of kernels for operations on small matrices. If all that is needed is a single operation on a single small matrix, it won’t be efficient to transfer this computation to a SIMD co-processor. We would have to transfer a quite large computation kernel, a small matrix and then only compute for a couple of hundred cycles. Associated overheads would just be to great. Another more reasonable scenario is that we want many small-sized matrix operations to be performed consecutively. We could then create kernels that perform several matrix operations at a time. This would definitely reduce a lot of the data dependency overhead that has been identified in the kernels.

9.1 Future work

There is a lot of possible future work surrounding the ePUMA project. An updated Sleipnir simulator is planned for which will have added features that will be

75 76 Conclusions investigated. Work is also being performed on the tool-chain. A few suggestions for future work that directly relates to this thesis could be: • Further improve Sleipnir’s micro-architecture and investigate the benefits this delivers to the performance of the matrix operation kernels. • Investigate possible alternative algorithms for matrix operations, especially for small matrices. • Further evaluation of kernels for other matrix operations. • Investigate typical usage areas for matrix computations and adapt the kernels to meet the needs of these applications.

• Implement and benchmark a complete application that relies on matrix computations and other kernels, to evaluate ePUMA’s overall performance. Bibliography

[1] Å. Björck. Solving linear least squares problems by Gram-Schmidt orthogo- nalization. BIT, 7:1–21, 1967.

[2] Richard L. Burden and J. Douglas Faires. Numerical Methods. Thomas Brooks/Cole, 3 edition, 2002. ISBN 0-534-40761-7.

[3] Lars Elden and Linde Wittmeyer-Koch. Numeriska beräkningar - analys och illustrationer med MATLAB. 4 edition, 2001. ISBN 91-44-02007-4.

[4] ePUMA research team. ePUMA hardware specifications. September 2009.

[5] ePUMA research team. Sleipnir instruction set manual. May 2010.

[6] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins Univ. Press, 3 edition, 1996. ISBN 0-801-85413-X.

[7] IBM. Cell Broadband Engine Architecture. Version 1.02.

[8] IDA ISY. White paper: Parallel programming and its architectures based on algorithm kernels and memory access kernels.

[9] Dake Liu. Embedded DSP processor design. Morgan Kaufmann, 1 edition, 2008. ISBN 978-0-12-374123-3.

[10] Amos R. Omondi. Computer arithmetic systems. Prentice Hall, 1 edition, 1994. ISBN 0-13-334301-4.

[11] H. Sorokin P. Salmela, A. Burian and J. Takala. Complex-valued QR decomposition implementation for MIMO receivers. 2008.

[12] Jan M. Rabaey. Digital integrated circuits. Prentice Hall, 2 edition, 2003. ISBN 0-130-909963.

[13] Lennart Råde and Bertil Westergren. Mathematics handbook for science and engineering. Studentlitteratur, 5 edition, 2004. ISBN 91-44-03109-2.

[14] S. H. Singh, C. K.; Prasad and P. T Balsara. VLSI architecture for matrix inversion using modified Gram-Schmidt based QR decomposition. University of Texas.

77 78 Bibliography

[15] William Stallings. Computer organization and architecture: designing for performance. Pearson Prentice Hall, 7 edition, 2006. ISBN 0-131-856448. [16] Gilbert Strang. Computational Science and Engineering. Wellesley- Cambridge Press, 1 edition, 2007. ISBN 978-0-961408-81-7. Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se för- lagets hemsida http://www.ep.liu.se/

Copyright The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

c Andréas Karlsson