<<

DESIGN AND IMPLEMENTATION OF A MULTITHREADED ASSOCIATIVE SIMD

A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

by

Kevin Schaffer

December, 2011

Dissertation written by Kevin Schaffer B.S., Kent State University, 2001 M.S., Kent State University, 2003 Ph.D., Kent State University, 2011

Approved by

Robert A. Walker, Chair, Doctoral Dissertation Committee

Johnnie W. Baker, Members, Doctoral Dissertation Committee

Kenneth E. Batcher,

Eugene C. Gartland,

Accepted by

John R. D. Stalvey, Administrator, Department of Science

Timothy Moerland, Dean, College of Arts and Sciences

ii

TABLE OF CONTENTS

LIST OF FIGURES ...... viii

LIST OF TABLES ...... xi

CHAPTER 1 INTRODUCTION ...... 1

1.1. Architectural Trends ...... 1

1.1.1. Wide-Issue Superscalar Processors...... 2 1.1.2. Chip Multiprocessors (CMPs) ...... 2

1.2. An Alternative Approach: SIMD ...... 3 1.3. MTASC Processor ...... 5 1.4. Dissertation Organization ...... 5

CHAPTER 2 ASSOCIATIVE COMPUTING ...... 7

2.1. Background ...... 7

2.1.1. Associative Memories ...... 8 2.1.2. Associative Processors ...... 10 2.1.3. STARAN...... 11 2.1.4. Massively Parallel Processor (MPP) ...... 12

2.2. The Associative Computing Model (ASC) ...... 12

2.2.1. Associative Search ...... 14 2.2.2. Responder Detection ...... 15 2.2.3. Responder Selection/Iteration ...... 15 2.2.4. Maximum/Minimum Search ...... 16 2.2.5. PE Interconnection Network ...... 17 2.2.6. MASC ...... 17

2.3. ASC Processor Prototypes ...... 19

2.3.1. First Processor ...... 19 2.3.2. Scalable Processor ...... 19 2.3.3. Pipelined Processor ...... 20 2.3.4. MASC Processor ...... 20 2.3.5. MTASC and Previous ASC Processors ...... 21 iii

2.4. Summary ...... 21

CHAPTER 3 PIPELINING ...... 22

3.1. Background ...... 22

3.1.1. Pipelining Basics ...... 23 3.1.2. Hazards ...... 24

3.2. Pipelining a SIMD Processor ...... 27

3.2.1. SIMD Instruction Types ...... 27 3.2.2. Pipelining Instruction Execution ...... 28 3.2.3. Pipelining the Broadcast and Reduction Networks ...... 29 3.2.4. SIMD-Specific Hazards ...... 30

3.3. MTASC Pipeline ...... 31

3.3.1. Unified Pipeline ...... 32 3.3.2. Hazards in the Unified Pipeline ...... 32 3.3.3. Diversified Pipeline ...... 34 3.3.4. Hazards in the Diversified Pipeline ...... 35 3.3.5. Comparison of Unified and Diversified Pipelines ...... 36

3.4. Summary ...... 37

CHAPTER 4 MULTITHREADING ...... 39

4.1. Background ...... 39 4.2. Scheduling ...... 42

4.2.1. Simple Scheduler ...... 42 4.2.2. Dependency-Aware Scheduler ...... 44 4.2.3. Semaphore-Aware Scheduler...... 47

4.3. Summary ...... 48

CHAPTER 5 MTASC: A MULTITHREADED ASSOCIATIVE SIMD PROCESSOR 50

5.1. Instruction Set Architecture ...... 50

5.1.1. Registers ...... 51 5.1.2. Instruction Formats ...... 53 5.1.3. ALU Instructions ...... 54 5.1.4. Load/Store Instructions ...... 54 5.1.5. Branch and Jump Instructions ...... 55 iv

5.1.6. Enable Stack Instructions ...... 55 5.1.7. Reduction Instructions ...... 56 5.1.8. Semaphore Instructions ...... 57

5.2. Organization ...... 57

5.2.1. ...... 57 5.2.2. Scalar ...... 59 5.2.3. Processing Elements (PEs) ...... 60 5.2.4. Broadcast/Reduction Network ...... 63 5.2.5. PE Interconnection Network ...... 64

5.3. MTASC Assembler ...... 64 5.4. MTASC Simulator ...... 65 5.5. Summary ...... 66

CHAPTER 6 MULTITHREADED ASSOCIATIVE BENCHMARKS ...... 67

6.1. Algorithmic Conventions ...... 67 6.2. Associativity ...... 68 6.3. Benchmarks...... 69

6.3.1. Jarvis March ...... 69 6.3.2. Minimum Spanning Tree ...... 71 6.3.3. Matrix-Vector Multiplication...... 74 6.3.4. QuickHull ...... 75 6.3.5. VLDC String Matching...... 77

6.4. Summary ...... 80

CHAPTER 7 SIMULATION RESULTS ...... 82

7.1. Comparison of Thread Scheduling Policies ...... 82 7.2. Single-Threaded Execution vs. Multithreaded Execution ...... 86 7.3. Summary ...... 88

CHAPTER 8 RELATED WORK ...... 89

8.1. ClearSpeed CSX ...... 89

8.1.1. Architecture...... 89 8.1.2. Associative Computing ...... 92 8.1.3. Comparison with MTASC ...... 93

8.2. Nvidia Tesla ...... 95

v

8.2.1. CUDA ...... 95 8.2.2. Architecture...... 97 8.2.3. Associative Computing ...... 101 8.2.4. Comparison with MTASC ...... 102

8.3. Summary ...... 103

CHAPTER 9 CONCLUSIONS AND FUTURE WORK ...... 104

9.1. Summary ...... 104 9.2. Contributions...... 105 9.3. Future Work ...... 107

BIBLIOGRAPHY ...... 109

APPENDIX A MTASC INSTRUCTION SET ...... 114

A.1. ALU Instructions...... 114 A.2. Load/Store Instructions ...... 115 A.3. Branch and Jump Instructions ...... 115 A.4. Enable Stack Instructions ...... 115 A.5. Reduction Instructions ...... 115 A.6. Semaphore Instructions ...... 115 A.7. Miscelleanous Instructions ...... 116

APPENDIX B MTASC ASSEMBLER AND SIMULATOR ...... 117

B.1. MTASCAssembly.g ...... 117 B.2. MTASCAssemblyWriter.g ...... 126 B.3. Simulator.cpp ...... 135 B.4. Machine.h ...... 136 B.5. Machine.cpp ...... 138 B.6. Instruction.h...... 145 B.7. Instruction.cpp ...... 147

APPENDIX C MULTITHREADED ASSOCIATIVE BENCHMARKS ...... 165

C.1. Jarvis March ...... 165 C.2. Minimal Spanning Tree (MST) ...... 167 C.3. Matrix-Vector Multiplication ...... 168 C.4. QuickHull ...... 169 C.5. VLDC String Matching ...... 173

vi

APPENDIX D ASC LIBRARY FOR CN ...... 177

D.1. asc.h...... 177 D.2. asc.cn ...... 185

vii

LIST OF FIGURES

Figure 2.1 An associative memory...... 8

Figure 2.2 STARAN array module [Batcher 1974]...... 11

Figure 2.3 Conceptual organization of an ASC computer...... 13

Figure 2.4 Conceptual organization of a MASC computer...... 18

Figure 3.1 Classic six-stage RISC pipeline...... 23

Figure 3.2 ...... 25

Figure 3.3 Pipelined broadcast network...... 29

Figure 3.4 Pipelined reduction network...... 30

Figure 3.5 Unified pipeline...... 32

Figure 3.6 Hazards in the unified pipeline (from top to bottom): broadcast, reduction, broadcast/reduction...... 33

Figure 3.7 Diversified pipeline...... 34

Figure 3.8 Hazards in the diversified pipeline (from top to bottom): broadcast, reduction, and broadcast/reduction...... 35

Figure 3.9 Forwarding in the unified pipeline...... 36

Figure 4.1 Types of hardware multithreading [Patterson 2009]...... 40

Figure 4.2 Simple scheduler...... 44

Figure 4.3 Breakdown of execution time into useful cycles, stall cycles and wait cycles...... 45

Figure 4.4 Dependency-aware scheduler...... 46

Figure 4.5 Semaphore-aware scheduler...... 48

Figure 5.1 Registers in the MTASC architecture...... 51

viii

Figure 5.2 Instruction formats...... 53

Figure 5.3 MTASC organization...... 58

Figure 5.4 Control unit organization...... 59

Figure 5.5 PE organization...... 62

Figure 5.6 Reduction network node...... 64

Figure 6.1 Associativity ratings for the benchmarks...... 69

Figure 6.2 Associative Jarvis March algorithm...... 71

Figure 6.3 Associative Minimum Spanning Tree (MST) algorithm...... 72

Figure 6.4 Associative Matrix-Vector Multiplication algorithm...... 74

Figure 6.5 Associative QuickHull algorithm...... 76

Figure 6.6 Associative VLDC String Matching algorithm...... 79

Figure 7.1 Comparison of thread schedulers executing Jarvis March algorithm...... 83

Figure 7.2 Comparison of thread schedulers executing MST algorithm...... 83

Figure 7.3 Comparison of thread schedulers executing Matrix-Vector Mult algorithm. . 84

Figure 7.4 Comparison of thread schedulers executing QuickHull algorithm...... 84

Figure 7.5 Comparison of thread schedulers executing VLDC String Match algorithm. 85

Figure 7.6 Comparison of multithreaded execution vs. single-threaded execution...... 87

Figure 7.7 Comparison of performance improvement for different benchmarks...... 87

Figure 8.1 ClearSpeed CSX700 [ClearSpeed 2011]...... 90

Figure 8.2 ClearSpeed Multithreaded Array Processor (MTAP) core [ClearSpeed 2011]...... 91

Figure 8.3 CUDA thread hierarchy [Nvidia 2010a]...... 96

Figure 8.4 CUDA [Nvidia 2010a]...... 98

Figure 8.5 Tesla GPU architecture [Nvidia 2009]...... 99

ix

Figure 8.6 Tesla Streaming Multiprocessor [Nvidia 2009]...... 100

Figure 8.7 Warp execution in a Tesla SM [Nvidia 2009]...... 101

x

LIST OF TABLES

Table 8.1 Execution times for Maximum/Minimum Functions on ClearSpeed CSX...... 94

xi

CHAPTER 1

Introduction

Single Instruction Multiple Data (SIMD) [Flynn 1972] are often used to solve computationally intensive problems that exhibit a high degree of fine-grain . The appeal of the SIMD model is its simplicity, both for software and hard- ware. Software for SIMD computers is generally simpler than for other types of parallel computers because SIMDs maintain the same implicit instruction-level synchronization as a sequential computer. Since the programmer does not need to handle synchronization explicitly, development time is reduced. At the hardware level, a SIMD computer is built up from relatively simple pieces, making it easy to adapt the SIMD model to new tech- nologies, such as systems-on-a-chip. In fact, commercial single-chip SIMD processor ar- rays are already available for a variety of applications [ClearSpeed 2011].

1.1. Architectural Trends

Until recently, single-core processor performance has doubled approximately every 18 months due in large part to increases in clock speeds. However, this trend is quickly coming to an end due to voltage leakage and heat dissipation limits [So- dan 2011]. While clock speeds are unlikely to improve much in the future, Moore‘s Law predicts that the number of transistors that can be placed on an will con- tinue to grow at an exponential rate. Thus, the challenge for computer architects is to

1

2

translate those additional transistors into additional performance. Two popular approach- es to challenge are the wide-issue and the chip multiprocessor

(CMP) [Burger 2004].

1.1.1. Wide-Issue Superscalar Processors

A superscalar processor contains multiple functional units, each of which is ca- pable of executing a single instruction [Johnson 1991]. At each cycle, the processor at- tempts to find multiple instructions which can be executed in parallel and then issues them to the functional units for execution. In a wide-issue superscalar processor, with a large number of functional units, it is often difficult to find enough independent instruc- tions to keep all of the functional units busy each cycle. These processors make use of special hardware structures such as instruction windows and reorder buffers in order to extract a sufficient amount of parallelism from the instruction stream. However, the size of these structures grows quadratically with the number of functional units. Thus, wide- issue superscalars make inefficient use of available transistors, devoting a large percen- tage to hardware that does not actually perform any computation.

1.1.2. Chip Multiprocessors (CMPs)

Chip multiprocessors (CMPs) place multiple processor cores on a single chip, connected by an on-chip network [Culler 1998]. Since CMPs are typically constructed from simple processor cores, they make much more efficient use of available transistors than wide-issue superscalar processors. However, CMPs suffer from two major limita- tions: it is difficult to write software that can take advantage of the additional parallelism;

3

and their scalability to limited by the scalability of the on-chip network.

Despite their limitations, one major advantage of wide-issue superscalar proces- sors is that they can improve performance of existing single-threaded code. This is be- cause the hardware itself extracts the parallelism from the code. CMPs require the compi- ler or (more often) the programmer to extract the parallelism from the code in order to improve performance. Unfortunately, writing parallel software for CMPs is no easy task—typically requiring manual mapping of tasks to processor cores and explicit syn- chronization.

The scalability of a CMP is highly dependent on the scalability of the on-chip network that connects the processor cores together. Many modern CMPs employ crossbar interconnects which provide high-performance, but do not scale well. As CMPs scale up to take advantage of ever increasing transistor counts, it will be necessary to move to simpler interconnects such as meshes, making these architectures even more difficult to program.

1.2. An Alternative Approach: SIMD

SIMD processors represent an alternative approach which overcomes many of the limitations of both wide-issue superscalar processors and chip multiprocessors (CMPs).

Similar to a superscalar processor‘s functional units, a SIMD processor contains multiple processing elements (PEs), which all operate in parallel. However, the PEs in a SIMD processor all execute the same instruction, performing the same operation across multiple data values. Thus, the control unit in a SIMD is no more complex than a single-issue pro-

4

cessor and its size is independent of the number of PEs. Virtually all of the transistors in a

SIMD processor are devoted to executing instructions rather than finding instructions to execute.

SIMD also expose the parallelism of the hardware to the software, requiring exist- ing sequential software to be rewritten. However, programming a SIMD is much simpler than programming a CMP. A SIMD has a single instruction stream which is implicitly synchronized at the instruction level, just like single-core processor or a superscalar.

Thus, there is no need for the programmer to be concerned about multiple instruction streams or synchronization between them.

However, SIMDs often have a lower clock speed than other parallel computers built with the same technology. The problem is the broadcast/reduction bottleneck [Al- len 1996]. As the number of processing elements (PEs) increases, the wires connecting the control unit and the PEs become longer and signals take longer to propagate. Since each instruction must be broadcast, this delay increases the clock cycle time, limiting sys- tem performance. Compounding the problem, most SIMD systems support reduction op- erations, which also act as a bottleneck. Pipelining the broadcast/reduction network helps to keep the clock cycle time short, but it introduces additional latency that, if not dealt with, can severely limit system performance. This effect is even more pronounced in as- sociative SIMD processors, which make extensive use of the reduction network in order to simulate the associative searching capabilities of an associative memory.

5

1.3. MTASC Processor

This dissertation presents the design and implementation of the MTASC Proces- sor, a multithreaded associative SIMD processor. The MTASC Processor uses a combi- nation of pipelined broadcast and reduction and fine-grain hardware multithreading to overcome the broadcast/reduction bottleneck. In addition to the design and implementa- tion of the MTASC Processor, the contributions of this dissertation research include the evaluation of three different hardware thread scheduling algorithms, the development of a cycle-accurate instruction-level simulator for the MTASC architecture, the adaptation of five associative algorithms to serve as benchmarks for the MTASC architecture, and the comparison of the MTASC architecture with two other similar, commercially available multithreaded data-parallel architectures.

1.4. Dissertation Organization

Chapter 2 provides an overview of the Associative Computing Model (ASC), the computational model that the MTASC Processor is based on. Chapter 3 introduces pipe- lining as one part of a solution to dealing with the broadcast/reduction bottleneck. Chap- ter 4 then discusses hardware multithreading, an implementation technique that, com- bined with pipelining, is able to overcome the broadcast/reduction bottleneck. Chapter 5 presents the MTASC Processor, including the instruction set architecture (ISA), the hardware organization, and the software simulator that was developed to measure the processor‘s performance and explore various architectural alternatives. Chapter 6 intro- duces a set of multithreaded associative benchmarks that were used to evaluate the

6

MTASC Processor‘s performance. Chapter 7 presents the results from executing the benchmarks on the software simulator. Chapter 8 compares the MTASC Processor with two similar architectures. Chapter 9 presents the conclusions of this dissertation and dis- cusses possible future work.

Appendix A contains a complete listing of the MTASC Processor‘s instruction set. Appendix B contains the source code for the MTASC Assembler and Simulator. Ap- pendix C contains the MTASC assembly language code for the benchmarks discussed in

Chapter 7. Appendix D contains the source code for the ASC Library for Cn.

CHAPTER 2

Associative Computing

Associative computing [Potter 1994] is a model that simplifies the task of writing software that exploits the high degree of data parallelism inherent in massively parallel SIMD arrays. Associative computing accomplishes this task by pro- viding a set of primitive data-parallel operations that take full advantage of the paral- lelism available at the hardware level. Thus, all the programmer needs to do is use these primitive operations and his or her software will automatically be able to utilize whatever parallelism the hardware can provide. Associative computing has applications in bioin- formatics [Steinfadt 2009], image processing [Wang 2004], parsing [Chitalia 2004], real time command and control [Meilander 2003], and string matching [Singh 2006].

This chapter provides an overview of associative computing. It will discuss the historical basis of associative computing, from associative memories to notable associa- tive processors. It will also describe ASC, a formalized model of associative computing.

Finally, it will discuss some of the previous work that has been done to implement the

ASC model in hardware.

2.1. Background

Associative computing has its origin in associative memories, devices that locate data based on its content rather than its address. Most of the capabilities found in a mod-

7

8

Figure 2.1 An associative memory. ern associative computer can be directly traced back to these devices and their successors, associative processors.

2.1.1. Associative Memories

Associative memory, also known as content-addressable memory (CAM), locates data based on content rather than traditional random-access memory (RAM) which lo- cates data based on its address [Chisvin 1989; Grosspietsch 1992]. As shown in Figure

2.1, an associative memory is organized as an array of words, each of which is typically hundreds or thousands of bits wide. To perform a search, the memory is presented with a comparand and a mask. The comparand is the value that is being sought and the mask

9

indicates which bits in each word should be matched against the comparand. By using a mask, a memory word can be split into fields, like the columns of a table, and the search can be performed only on those fields which are of interest.

The result of the search is a response vector, containing one bit for each word, where a one in the response vector indicates that the corresponding word matched the comparand and a zero indicates a mismatch. The response store contains the result from the last search, allowing the results from multiple searches to be combined together. Typ- ically, an associative memory also provides logic in the form of a global response opera- tions unit to detect whether any bits in the response register are set or, if multiple bits are set, to allow each one to be selected in turn.

Associative memories of the type shown above are actually capable of much more than simple exact-match searches. The global response operations unit can be used to per- form bit-serial global searches such as identifying the word with the maximum or mini- mum value, finding a word whose value is nearest to the comparand, etc. [Falkoff 1962;

Parhami 1997]. In fact, it is this capability to quickly perform global searches that is one of the key characteristics of associative computing.

Associative memories can be classified into one of three general categories based on their implementation: fully associative, bit-serial, or word-serial. A fully associative memory has a comparator attached to each bit cell, allowing the entire memory to be searched in a single step. Unfortunately this design does not scale well, making large ful- ly associative memories prohibitively expensive. Bit-serial associative memories reduce the cost of the hardware by sharing a single comparator between all the bits of a word. To

10

perform a search, the bits of each word are cycled through the comparator one at a time, with all the words searched in parallel. Thus, the tradeoff for reducing the cost of the hardware is an increase in search time, which is now proportional to the number of bits per word. Finally, word-serial associative memories contain a single word-wide compa- rator which is shared by all the words in the memory. This reduces the hardware cost even more than a bit-serial memory, but now the time necessary to perform a search is proportional to the number of words. Hybrid implementations are also possible, such as byte-serial or k-bit-serial; such memories function the same as bit-serial, but 8 or k bits at a time instead of one.

2.1.2. Associative Processors

In practice bit-serial associative memories represent the best tradeoff between hardware cost and performance. Even more importantly, however, is that bit-serial asso- ciative memories paved the way for associative processors [Parhami 1973; Weems 1997;

Yau 1977]. An associative processor is similar in structure to a bit-serial associative memory except that the comparators are replaced by processing elements (PEs), which can perform a variety of arithmetic, logic, and comparison operations. Furthermore, the

PEs can not only read data from their associated memory words, but they can write data back as well.

An associative memory can quickly locate data, but updating that data is still an inherently sequential task since the data must be transferred from the memory to the CPU and then back to the memory. An associative processor, however, can not only locate da-

11

Figure 2.2 STARAN array module [Batcher 1974]. ta quickly, but also update the data in parallel without having to transfer it to the CPU.

2.1.3. STARAN

The first commercially successful associative processor was the STARAN, devel- oped by Goodyear Aerospace and completed in 1972 [Batcher 1974; Batcher 1982]. The heart of the STARAN architecture is the array module, shown in Figure 2.2, which con- sists of an array of 1-bit PEs, a multidimensional access (MDA) memory, a flip (permuta- tion) network that connects the PEs to the memory, and a resolver unit that functions as the global response operations unit. The MDA memory is a 256 word by 256 bit memory designed so that it can be accessed in either bitslice mode or word mode. In bitslice

12

mode, a single bit of each word can be accessed at once, allowing the 256 element PE array to operate on the data in parallel. In word mode, all 256 bits of a single word can be accessed for efficient input or output. The flip network makes it possible to move data between PEs in parallel and is particularly well-suited to fast Fourier transforms (FFT) and sorting.

2.1.4. Massively Parallel Processor (MPP)

The Massively Parallel Processor (MPP), also developed by Goodyear Aerospace, is based on a simpler architecture, but achieves an exceptionally high level of perfor- mance by exploiting massive parallelism [Batcher 1980; Batcher 1982]. The MPP con- tains 16,384 1-bit PEs organized as a 128 PE by 128 PE square. The PEs contain specia- lized hardware for performing multiplication and floating-point arithmetic and can com- municate through a 2D mesh interconnection network. A sum-or tree, capable of combin- ing together the outputs from all 16,384 PEs, serves as the global response operations unit.

2.2. The Associative Computing Model (ASC)

During the 1990s, several researchers from the Department of Computer Science at Kent State University developed a computational model based on associative proces- sors like the STARAN and MPP [Potter 1992; Potter 1994]. The Associative Computing

Model (ASC) formalizes the capabilities generally found in associative processors, mak- ing it possible to develop algorithms that take advantage of these capabilities and to ana- lyze their performance. In this way, ASC is similar to the Parallel Random-Access Ma-

13

Figure 2.3 Conceptual organization of an ASC computer. chine (PRAM) and other parallel computational models.

Figure 2.3 shows the conceptual model of a computer in the ASC model. There is a single control unit, also known as an instruction stream (IS), and multiple processing elements (PEs), each with its own local memory. Taken together, a PE and its local memory are sometimes referred to as a cell. The control unit and PE array are connected through a broadcast/reduction network and the PEs are connected together through a PE interconnection network.

ASC is a SIMD . The control unit fetches and decodes pro- gram instructions and broadcasts control signals to the PEs. The PEs, under the direction of the control unit, execute these instructions using their own local data. All PEs execute instructions in a lock step manner, with an implicit synchronization between every in- struction—much like a sequential processor. SIMD computers are particularly well suited

14

for fine-grain data parallel programming, where the same operation is applied to each element of a data set.

ASC extends the basic capabilities of a SIMD computer by providing a set of primitive high-speed data-parallel operations. These operations can be grouped into four general categories: associative search, responder detection, responder selection/iteration, and maximum/minimum search.

2.2.1. Associative Search

The basic operation in an associative algorithm is the associative search. The as- sociative search operation in ASC is very similar to, and in fact modeled on, the search operation of an associative memory. An associative search simultaneously locates all the

PEs whose local data matches a given search key. The PEs with matching data are called responders and those with non-matching data are called non-responders. The search key can be compared with the PEs‘ local data using any basic comparison operator, such as equals, greater than, less than, etc.

Typically multiple associative searches are performed and their results are com- bined together in order to determine the final set of responders. The state of each PE (res- ponder or non-responder) is encoded as a bit stored within a PE register. Logical opera- tions on these responder bits can then be used to perform various set operations: logical

AND for intersection, logical OR for union, etc. After the final set of responders is known, the algorithm can restrict further processing to only affect the responders by deactivating the non-responders (or vice versa).

15

2.2.2. Responder Detection

After performing a search, an algorithm will often want to know whether there were any responders or, sometimes, exactly how many responders there were. Typically this is used as an optimization—if there are no responders to the search, then there is little point to performing the steps associated with processing them. Other times, the lack of responders indicates an error condition and the algorithm must take an appropriate action.

Most associative processors only provide a binary responder count: some res- ponders or no responders. This is the simplest approach in terms of hardware implemen- tation as it only requires a logical OR of the responder bits from each PE. In spite of its simplicity, the some/none response is all that is required by the ASC model, where it is called the AnyResponders function, as it is sufficient for most associative algo- rithms [Jin 2001].

In addition to the binary some/none indicator, some associative processors pro- vide a ternary response count: no responders, exactly one responder, or more than one responder; still other processors can provide an exact count of the number of responders

[Parhami 1973]. The one/multiple/none indicator can be useful when implementing max- imum/minimum search, if this capability is not provided by the hardware and must be emulated in software [Falkoff 1962]. An exact count of responders can be used to imple- ment efficient rank-based selection [Parhami 1997].

2.2.3. Responder Selection/Iteration

If, after performing an associative search, an algorithm determines that there are

16

results to process, it can process those responders in one of three different modes: paral- lel, sequential, or single selection. Parallel responder processing performs the same set of operations on each responder simultaneously by deactivating the non-responders and then broadcasting instructions to the PE array. Sequential responder processing selects each responder individually, one at a time. This is typically used when the algorithm needs to handle each responder differently or when performing inherently sequential operations such as input and output. Finally, single responder selection selects one, arbitrarily cho- sen, responder to undergo processing. This is often used after a maximum or minimum search, where all responders have the same data.

The basic function that enables both sequential and single selection modes of op- eration is called PickOne. The PickOne function selects a single responder from a set. For single selection mode, this function is applied once. For sequential mode, the PickOne function is invoked repeatedly to select each responder and continues until there are no responders left. The PickOne function can be implemented directly in hardware using a multiple response resolver, like the STARAN, or it can be implemented in software by performing a maximum or minimum search on a unique PE identification number.

2.2.4. Maximum/Minimum Search

In addition to simple searches, where each PE compares its local data against a search key, an associative computer can also perform global searches, where data from the entire PE array is combined together to determine the set of responders. The most common type of global search is the maximum/minimum search. The responders to a

17

maximum/minimum search are those PEs whose data is the maximum or minimum value across the entire PE array.

2.2.5. PE Interconnection Network

Most associative processors include some type of PE interconnection network to allow parallel data movement within the array. The ASC model itself does not specify any particular interconnection network and, in fact, many useful associative algorithms do not require one. Typically associative processors implement simple networks such as

1D linear arrays or 2D meshes. These networks are simple to implement and allow data to be transferred quickly in a synchronous manner. For associative algorithms that require more powerful PE interconnections, the coterie network has been proposed [Her- bordt 1997]. A coterie network allows the PEs in an associative processor to dynamically group themselves together into arbitrary patterns.

2.2.6. MASC

If both the responders and the non-responders of a search need to be processed

(but in different ways), then this processing has to be serialized. The responders are se- lected and processed first, then the non-responders. The serialization is necessary because a single control unit can broadcast only one sequence of instructions at a time.

18

Figure 2.4 Conceptual organization of a MASC computer.

The MASC model is an extension of ASC that overcomes this limitation, by al- lowing for multiple control units. Figure 2.4 shows the organization of a MASC comput- er. In addition to the extra control units, there is a control unit interconnection network to enable communication and synchronization between the control units. Each control unit in a MASC computer controls a dynamic partition of the PEs. After performing a search, the responders are assigned to one control unit while the non-responders are assigned to another. Both control units can then broadcast separate instruction sequences simulta- neously to their respective PEs [Chantamas 2005; Scherger 2003].

For this dissertation I will only be considering the single control unit ASC model.

However, most of this research could easily be applied to the MASC model as well. This may be explored in future work.

19

2.3. ASC Processor Prototypes

Several students at Kent State have already developed prototype processors that implement the associative computing model. While the processor presented in this disser- tation is not a direct extension of any of these previous processors, it nevertheless bor- rows ideas from these earlier works.

2.3.1. First Processor

Meiduo Wu and Yanping Wang [Walker 2001; Wu 2002] developed the first ver- sion of the ASC processor. This first processor had an 8-bit control unit, 4 8-bit PEs, and a broadcast/reduction network capable of maximum/minimum search, responder detec- tion, and all three modes of multiple responder resolution. The instruction set was mod- eled after RISC processors such as MIPS and DLX with associative features similar to the STARAN. The processor was designed in VHDL and targeted at an Altera FLEX

10K70 FPGA. However, the final design did not fit on the FPGA and so it was never ful- ly implemented.

2.3.2. Scalable Processor

Hong Wang [2003] later improved on the first processor by making it scalable, so that the number of PEs could be varied easily. In order to make the processor scalable, the broadcast/reduction network had to be redesigned, as the network in the first proces- sor could only handle four PEs. There were no major changes to the instruction set archi- tecture, so that the scalable processor could run most software written for the first one.

The new scalable processor was then implemented in a much larger Altera APEX

20

20K1000 FPGA, which could hold the control unit and up to 50 PEs.

Later, Hong Wang [2004] and Lei Xie added a PE interconnection network to the processor. The network could operate either as a 1D linear array or as a 2D mesh. The processor with the network was then tested on a number of applications, including string matching and image processing.

2.3.3. Pipelined Processor

In order to improve performance, Hong Wang [2005] further modified the scala- ble processor by pipelining instruction execution. Wang used a classic RISC pipeline with five stages: instruction fetch, instruction decode, execution, memory access, and write back. While this pipeline did improve performance over the original non-pipelined implementation, it still suffered from the broadcast/reduction bottleneck, because the broadcast and reduction operations were not pipelined. The new processor also added a reconfigurable network that was capable of skipping over PEs that were disabled.

2.3.4. MASC Processor

Hong Wang [2006] also completed the first processor to implement the MASC model by adding multiple control units to the ASC processor. This processor is able to dynamically partition the PE array and assign control of different PEs to different control units. This allows the processor exploit both control parallelism and data parallelism. The

MASC Processor was implemented on the Altera APEX 20K1000 FPGA with 3 control units and 9 PEs.

21

2.3.5. MTASC and Previous ASC Processors

These previous ASC processors focused primarily on providing an implementa- tion of the ASC/MASC model in hardware; something which did not exist previously.

With the Pipelined ASC Processor, Wang did attempt to address performance specifical- ly. However, his processor only pipelines instruction execution within the PEs, assuming that the latencies of the broadcast and reduction networks are relatively short. The

MTASC Processor goes one step further and attempts to improve performance for pro- cessors with many PEs and where broadcast and reduction latencies are relatively high.

2.4. Summary

This chapter presented an overview of associative computing. It discussed asso- ciative memories and processors, showing how associative computing developed from these earlier technologies. It introduced the ASC model and discussed the primitive oper- ations that a processor must support to implement the model. And finally, it discussed the earlier ASC processors developed by other Kent State students.

CHAPTER 3

Pipelining

As discussed in Chapter 1, the broadcast/reduction bottleneck can severely limit the performance of a SIMD processor, especially an associative SIMD processor. Pipelin- ing the broadcast and reduction networks is one technique that can be used to mitigate the effects of the bottleneck.

This chapter will discuss the issues involved in designing an efficient pipeline for an associative SIMD processor. It will provide some background information on pipelin- ing and pipeline hazards, discuss the specific issues that arise in designing a pipelined

SIMD processor and conclude with a comparison of two possible pipeline organizations for the MTASC Processor. Some of the material in this chapter was previously published in [Schaffer 2007].

3.1. Background

Pipelining is an implementation technique used in virtually every modern proces- sor to improve performance by exploiting instruction-level parallelism (ILP) [Hennes- sy 2007]. By dividing instruction execution into multiple stages, it is possible to have several instructions executing simultaneously within the processor—each one in a differ- ent stage. Since the clock cycle time is now defined by the time to complete a single stage, instead of the time to complete an entire instruction, the processor can operate at a

22

23

Figure 3.1 Classic six-stage RISC pipeline. much higher . Assuming that the processor can begin execution of a new in- struction every cycle, the number of instructions completed per unit time (instruction throughput) will be improved significantly.

3.1.1. Pipelining Basics

In theory, a pipelined processor that divides instruction execution into k stages can achieve a speedup of k. In practice, however, the processor will not always be able to begin execution of a new instruction every cycle, and this reduces the potential perfor- mance gain from implementing pipelining. Thus the key is designing high performance pipelined processors is to identify those situations that prevent the processor from begin- ning execution of a new instruction (hazards) and either eliminate them completely or, at the very least, mitigate their effects as much as possible.

Though the history of pipelined processors goes all the way back to the Stretch computer (IBM 7030), completed in 1961, it was the introduction of the early reduced instruction set computers (RISCs), such as MIPS, during the 1980s that brought pipelin- ing to the desktop computer [Patterson 2009]. These early RISC processors typically fea- tured a five- or six-stage pipeline.

An example of a six-stage pipeline is shown in Figure 3.1. The instruction fetch

(IF) stage fetches the next instruction from the instruction . The instruction decode

24

(ID) stage decodes the fetched instruction, generates the control signals necessary to ex- ecute that instruction and checks for hazards. The register read (RD) stage reads the in- struction‘s source operands (if any) from the . The execute (EX) stage ex- ecutes ALU instructions and computes the effective address for load or store instructions.

The memory access (MEM) stage performs a read from or write to the data cache for load or store instructions. Finally, the write back (WB) stage writes the instruction‘s result (if any) back into the register file.

3.1.2. Hazards

A pipeline hazard is a condition that prevents the processor from issuing the next instruction. For the purposes of this dissertation, instruction issue is defined as the process by which an instruction exits the decode stage. If nothing is done to eliminate the hazard, then the processor must stall the pipeline in order to ensure correct operation.

Stalling the pipeline prevents the instructions in the decode stage and all earlier stages from advancing, while allowing instructions later in the pipeline to continue. This is ac- complished by inserting a bubble or no-op into the pipeline to take the place of the stalled instruction.

Figure 3.2 illustrates this process: in cycle 3 the processor detects a hazard be- tween the lw (load word) instruction and the and instruction. The lw instruction produc- es its result in the write back (WB) stage, which it will reach during cycle 6. However, the and instruction consumes the result of the lw instruction during the execute (EX) stage, which, under normal circumstances, it would reach during cycle 5. To resolve the

25

Figure 3.2 Pipeline stall. hazard, the processor stalls the execution of the and instruction (and all subsequent in- structions) by one cycle. Now, the and instruction consumes the result of the lw instruc- tion when it is first available in cycle 6. This is an example of a data hazard, one of three types of pipeline hazards: data hazards, control hazards, and structural hazards.

Data hazards occur when there is a data dependency between instructions, as shown in Figure 3.2. Data hazards can be further subdivided into read-after-write

(RAW), write-after-read (WAR), and write-after-write (WAW) hazards. RAW hazards result from a true data dependency between instructions where a later instruction uses the result of an earlier instruction that has not yet completed. In most cases, the result of the earlier instruction has been computed, but has yet to be written back into the register file.

The solution is this case is forwarding (or bypassing), where the result of the earlier in- struction is transferred directly to the later instruction, bypassing the register file [Hen- nessey 2007]. WAR and WAW hazards are not true data dependencies, but anti- dependencies and output dependencies, respectively; as such, these hazards can almost always be eliminated architecturally. For example, by having a single pipeline stage dedi-

26

cated to writing results to the register file and ensuring that instructions enter this stage in a strictly first-in, first-out order, WAR and WAW hazards can be avoided completely.

Control hazards (or branch hazards) occur when the processor does not know which instruction to fetch next because of an earlier branch instruction that has not com- pleted. This can occur because either the branch target address is unknown or, in the case of a conditional branch, because the branch outcome (taken or not taken) is unknown.

Note that, technically, control hazards prevents an instruction from exiting the fetch stage, as opposed to the decode stage.

Most pipelined processors employ delayed branches, where the instruction fol- lowing a branch is always executed, in order to avoid designing a pipeline that can stall at two different points [Lilja 1988]. Control hazards caused by an unknown branch target address can typically be resolved by moving the branch target calculation earlier in the pipeline. Control hazards due to conditional branches are commonly dealt with by branch prediction. If the processor predicts the branch correctly, then there is no performance penalty; but, if the prediction is wrong, the processor must flush the pipeline, cancelling the instructions that should not have been executed. Branch prediction schemes vary from simple, static schemes (i.e., predict not taken) to complex, dynamic schemes (i.e., branch history tables) [Lilja 1988].

Structural hazards occur when two instructions in different pipeline stages require the same resource at the same time. For example, attempting to fetch an instruction and read a word of data from the same cache simultaneously results in a structural hazard.

The cache has a single read port, so it cannot perform an instruction fetch and a data read

27

in the same cycle. Structural hazards can generally be solved by adding more resources to the processor, such as using separate instruction and data caches or building a cache with multiple read ports.

3.2. Pipelining a SIMD Processor

In designing a pipelined SIMD processor, there are some SIMD-specific issues that we must deal with in addition to the issues previously discussed with regard to pipe- lining a . First, there are actually two ways to pipeline a SIMD processor: pipelining instruction execution and pipelining the broadcast and reduction networks [Al- len 1995]. Second, there are additional types of hazards which are SIMD-specific.

3.2.1. SIMD Instruction Types

Before discussing the issues involved in designing a SIMD pipeline, it is helpful to first discuss the types of instructions that a SIMD processor must be able to handle.

Instructions in a SIMD processor can be classified into one of four categories based on which subsystems they use.

Scalar instructions execute entirely within the scalar execution unit. This includes the types of instructions found in a typical scalar processor such as ALU instructions, load/store instructions, branches, etc. Scalar instructions consume scalar operands and

(optionally) produce scalar results.

Parallel instructions execute on the Processing Element (PE) array. The instruc- tions originate in the control unit and, as such, must be distributed to the PE array via the broadcast network. Once each PE receives a copy of the instruction, the PEs can then ex-

28

ecute those instructions used data stored locally within each PE. Parallel instructions con- sume parallel operands and produce parallel results.

Broadcast instructions transfer data from the scalar execution unit to the PE array.

Both the instruction and its scalar operand must traverse the broadcast network before executing in the PEs. Broadcast instructions consume scalar and parallel operands and produce parallel results. Since the data to be broadcast in a broadcast instruction follows the same path as the instruction itself, there is practically no difference between a parallel instruction and broadcast instruction, at least as far as pipeline design is concerned.

Therefore parallel and broadcast instructions are usually grouped together.

Reduction instructions transfer data from the PE array to the scalar execution unit.

As with parallel and broadcast instructions, the instruction itself must be distributed, through the broadcast network, to the PEs. The PEs then send their data to the reduction network where it is combined into a scalar value and then sent to the scalar execution unit. Reduction instructions consume parallel operands and produce scalar results.

3.2.2. Pipelining Instruction Execution

The first step in pipelining a SIMD processor is pipelining instruction execution

[Allen 1995]. Pipelining instruction execution is much the same for a SIMD processor as it is for a scalar processor and so we start off with the same pipeline as we would for sca- lar processor: instruction fetch (IF), instruction decode (ID), register read (RD), execute

(EX), memory access (MEM), and write back (WB). The fetch and decode stages occur within the control unit while the remaining stages occur in the PE array for parallel in-

29

Figure 3.3 Pipelined broadcast network. structions or in the scalar execution unit for scalar instructions. Broadcast and reduction operations are not pipelined and therefore complete in a single pipeline stage: broadcast in the decode stage and reduction in the execute stage. This type of pipelining has already been implemented in both a general SIMD processor [ClearSpeed 2011] and an associa- tive SIMD processor [Wang 2005].

3.2.3. Pipelining the Broadcast and Reduction Networks

While pipelining instruction execution does improve performance in a SIMD pro- cessor, it does nothing to mitigate the effects of the broadcast/reduction bottleneck. As the number of PEs increases, the performance of a SIMD processor will be limited more by the time it takes to perform broadcast and reduction operations than anything else. To overcome the broadcast/reduction bottleneck, it is necessary to pipeline the broadcast and reduction networks. A pipelined broadcast network, as shown in Figure 3.3, is a k-ary tree

30

Figure 3.4 Pipelined reduction network. with a register at each node. It can accept a new instruction each clock cycle and it deliv- ers the instruction to the PE array after a latency of logk n cycles, for a machine with n

PEs.

A pipelined reduction network, as shown in Figure 3.4, is similar except that data flows in the opposite direction and at each node there is a functional unit that combines k values together before storing the result in a register. Since the distance that signals must propagate in a pipelined broadcast or reduction network is shorter, the network can oper- ate at a much higher clock rate than a non-pipelined network. The downside to this ap- proach, however, is that the pipelined networks introduce additional hazards into the pipeline.

3.2.4. SIMD-Specific Pipeline Hazards

In addition to the hazards typically found in scalar processor pipelines, there are

31

also hazards which are specific to SIMD processors. Broadcast hazards occur when a parallel instruction uses the result of an earlier scalar instruction. The parallel instruction must wait for the scalar result to traverse the broadcast network before it is available for use. Reduction hazards occur when a scalar instruction uses the result of an earlier reduc- tion instruction and the scalar instruction must wait for the reduction network to produce the result. Broadcast-reduction hazards are, as the name implies, a combination of a broadcast hazard and reduction hazard. They occur when a parallel instruction uses the result of an earlier reduction instruction. The parallel instruction must wait for the both the reduction network to compute the result and for the broadcast network to then distri- bute that result to the PE array.

Broadcast, reduction, and broadcast-reduction hazards are all types of data ha- zards. As mentioned earlier, the typical solution to a data hazard is forwarding. Unfortu- nately, the distance between the leading instruction (which produces the result) and the trailing instruction (which consumes the result) is a function of the number of PEs. For any processor with a reasonable number of PEs, the distance is far too great to be solved with forwarding alone. The second commonly used approach to deal with data hazards, static scheduling, also fails in this instance due also the large gap between when the data is computed vs. when it is used.

3.3. MTASC Pipeline

During the course of this dissertation research two different pipeline designs were considered: a unified pipeline which sends all instructions through the same pipeline

32

Figure 3.5 Unified pipeline. stages and a diversified pipeline that only sends an instruction through those pipeline stages that are necessary for its execution.

3.3.1. Unified Pipeline

The unified pipeline is the classic RISC six-stage pipeline with additional stages for the broadcast and reduction networks. Figure 3.5 shows the pipeline organization. The instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and write back (WB) stages are basically the same as in a scalar pipeline. The register read stage is split into a scalar register read stage (SR) and a parallel register read stage (PR).

Separate stages are made necessary by broadcast instructions which can use both a scalar operand, which must be broadcast along with the instruction, along with a parallel ope- rand. The number of broadcast (Bi) and reduction (Ri) stages is variable and depends on the number of PEs and the organization of the broadcast and reduction networks. Note that the first two reduction stages (R1 and R2) overlap with the EX and MEM stages and are therefore not shown in the figure.

3.3.2. Hazards in the Unified Pipeline

In the unified pipeline, broadcast hazards, which occur when a broadcast instruc- tion uses the result of a scalar instruction, can stall the pipeline for b + 1 clock cycles in the worst case, where b is the latency of the broadcast network. As shown in the first ex-

33

Figure 3.6 Hazards in the unified pipeline (from top to bottom): broadcast, reduction, broadcast/reduction. ample of Figure 3.6, the result of the scalar instruction, add, is not available until it enters the memory access stage (MEM), while the broadcast instruction, sub, needs it by the first broadcast stage (B1).

Reduction hazards, which occur when a scalar instruction uses the result of a re- duction instruction, can stall the pipeline up to r – 1 clock cycles, where r is the latency of the reduction network. As shown in the second example of Figure 3.6, the scalar in- struction, add, consumes its operands in the execute stage (EX), but the result of the re- duction instruction, max, is not available until its write back stage (WB).

Broadcast-reduction hazards, which occur when a broadcast instruction uses the result of a reduction instruction, can stall the pipeline up to b + r clock cycles. The third example in Figure 3.6 illustrates this type of hazard in the unified pipeline. The broadcast instruction, sub, needs its scalar operand by the first broadcast stage (B1), but the reduc- tion instruction, max, does not produce its result until the write back stage (WB).

34

Figure 3.7 Diversified pipeline.

3.3.3. Diversified Pipeline

The diversified pipeline differs from the unified pipeline in that it only sends an instruction through those stages which are necessary for its instruction. For example, a parallel instruction does not use the reduction network, so sending the instruction through the reduction stages (R1–Rn) unnecessarily delays its completion. Similarly, scalar in- structions use neither the broadcast nor the reduction networks and, therefore, need not go through either the broadcast (B1–Bn) or reduction (R1–Rn) stages. Thus, the diversi- fied pipeline provides customized paths for each instruction type.

Figure 3.7 shows the organization of the diversified pipeline. On the left are the instruction fetch (IF), instruction decode (ID), and scalar register read (SR) stages, which are common to all instruction types. However, after the scalar register read stage (SR) the pipeline splits, with scalar instructions moving on to the execute stage (EX), and broad- cast, parallel and reduction instructions moving on to the first broadcast stage (B1). The parallel pipeline then splits once again after the parallel register read stage (PR), with broadcast and parallel instructions moving on to the execute stage (EX) and reduction

35

Figure 3.8 Hazards in the diversified pipeline (from top to bottom): broadcast, reduc- tion, and broadcast/reduction. instructions moving on to the first reduction stage (R1).

3.3.4. Hazards in the Diversified Pipeline

The diversified pipeline eliminates stalls due to broadcast hazards. The scalar in- struction no longer has to go through the broadcast stages (B1–Bn), so its result is availa- ble earlier. Thus, forwarding alone is able to compensate for the hazard, as shown in the first example of Figure 3.8.

Reduction hazards can stall for up to b + r clock cycles, as shown in the second example of Figure 3.8. This appears to be an increase over the worst case penalty for re- duction hazards in the unified pipeline, which stalled for at most r – 1 clock cycles. How- ever, if you compare Figure 3.6 with Figure 3.8, you will see that the reason for the uni- fied pipeline‘s better performance is because the scalar instruction must go through the broadcast stages (B1–Bn) and the parallel register read stage (PR), even though it does

36

Figure 3.9 Forwarding in the unified pipeline. nothing in those stages. Thus the fact that the unified pipeline forces the scalar instruction through stages it does not need is merely hiding the stall cycles that would otherwise be there.

Broadcast-reduction hazards can stall the pipeline for up to b + r clock cycles, the same as the unified pipeline. This is illustrated in the third example of Figure 3.8.

3.3.5. Comparison of Unified and Diversified Pipelines

At first it may appear that the two pipeline organizations would have similar per- formance since the number of stall cycles is the same, just distributed differently between the different types of hazards. However, as discussed in the previous section, this is only because the unified pipeline hides some stall cycles with unused pipeline stages; howev- er, the result is the same. Taking this into consideration, the unified pipeline actually stalls more often than the diversified pipeline.

However, another major disadvantage of the unified pipeline that is not evident from the previous figures is that it requires significantly more forwarding hardware than the diversified pipeline. This is a direct result of the fact that the unified pipeline forces

37

each instruction to go through the same sequence of pipeline stages, even when the in- struction does not require them. Thus, most instructions compute their results early in the pipeline, but must wait until the final stage to write those results back to the register file.

So to avoid stalling later instructions waiting on those results, multiple forwarding paths must be available to send those results to the dependent instructions. This is illustrated in

Figure 3.9.

The diversified pipeline is clearly the better choice since it stalls less often than the unified pipeline and because it requires less hardware to implement forwarding. Fur- thermore, another advantage of the diversified pipeline is the fact that it can issue two , one to the broadcast/parallel/reduction pipelines and one to the scalar pipeline. Though not implemented in the version of the MTASC Processor de- scribed in this dissertation, this could be used in a future version of the processor to sig- nificantly improve performance.

3.4. Summary

This chapter examined pipelining as a potential solution to the broad- cast/reduction bottleneck. It provided some background information on pipelining in gen- eral as well as a discussion of the specific issues that show up when trying to pipeline a

SIMD processor. Two potential pipeline organizations for the MTASC Processor were considered: a unified pipeline and a diversified pipeline. The diversified pipeline is the better choice as it stalls less often than the unified pipeline and it requires less hardware to implement forwarding. Additionally, the diversified pipeline supports simultaneous

38

issue of instructions to the PE array and the scalar execution unit.

CHAPTER 4

Multithreading

As discussed in Chapter 3, pipelining the broadcast and reduction networks of a

SIMD processor can help to alleviate the broadcast/reduction bottleneck. Unfortunately, pipelining the networks also introduces additional hazards, which if not dealt with, can easily cancel out any potential gains in performance. The typical techniques for dealing with hazards in a pipelined processor, such as forwarding and static scheduling, can help reduce the number of stalls, but cannot eliminate them completely. In a pipelined associa- tive SIMD processor, the latencies of the broadcast/reduction networks are too large.

This chapter will discuss hardware multithreading as a means of further reducing stalls in a pipelined associative SIMD processor. It will provide a brief introduction to hardware multithreading and then discuss the issues involved in adding multithreading to an associative SIMD processor, such as the MTASC Processor. Some of the material pre- sented in this chapter was previously published in [Schaffer 2008a] and [Schaffer 2008b].

4.1. Background

Hardware multithreading is a technique for improving hardware utilization and instruction throughput in the face of pipeline stalls and high latency operations. A mul- tithreaded processor dynamically schedules instructions from multiple threads in order to fill stall cycles [Ungerer 2002]. Multithreading exploits thread-level parallelism (TLP),

39

40

Figure 4.1 Types of hardware multithreading [Patterson 2009]. which scales much better than ILP. As shown in Figure 4.1, multithreading can be im- plemented in hardware at one of three levels: coarse-grain multithreading, fine-grain mul- tithreading, or simultaneous multithreading (SMT).

Coarse-grain multithreading allows a thread to run until it encounters an opera- tion that would cause it to stall for a significant amount of time, at which point the pro- cessor to a different thread [Agarwal 1992]. The primary advantage of this ap- proach is that it requires the least hardware of all three multithreading approaches. The downside, however, is that coarse-grain multithreading can only help maintain through- put in the face of stalls caused by high latency operations; it is unable to help with shorter stalls. Additionally, coarse-grain multithreaded processors, in order to simplify the pipe- line control logic, typically require that all the instructions in the processor‘s pipeline be from a single thread. Thus, a thread requires a pipeline flush, incurring a start-up penalty while the pipeline is filled again.

Fine-grain multithreading switches threads on every clock cycle [Agarwal 1992].

The threads are typically selected in a round-robin fashion, skipping threads that are una-

41

ble to execute that cycle. Because fine-grain multithreading switches threads every clock cycle, it is able to deal with short stalls, such as those due to branch mispredictions or da- ta hazards, as well as longer stalls, such as those caused by cache misses. The main dis- advantage of fine-grain multithreading is that it requires more hardware to implement than course-grain multithreading. In order to be able to switch threads on a cycle-by- cycle basis, all thread state, including register files and program counters (PCs), must be fully replicated. Additionally, fine-grain multithreading requires additional control logic to deal with the fact that, at any given time, the processor‘s pipeline will contain instruc- tions from multiple threads.

Simultaneous multithreading (SMT) issues instructions from multiple threads in the same clock cycle [Levy 1995]. SMT takes advantage of the fact that a single thread is typically unable to utilize all the functional units in multiple-issue superscalar processor every cycle. To increase the functional unit utilization, the processor looks for instruc- tions from multiple threads in order to attempt to fill the issue slots that would otherwise go empty. Like fine-grain multithreading, SMT is able to hide both short and long stalls and, in general, it outperforms both fine-grain multithreading and coarse-grain multith- reading for processors that can issue multiple instructions in a single cycle. This makes it the most commonly used approach in modern superscalar processors.

The latency of a broadcast and reduction operations depends on the number of

PEs and the design of the broadcast/reduction network, but typically varies from a few cycles for a small machine to tens of cycles for a larger one. However, even for a proces- sor with hundreds or thousands of PEs, the latency is not large enough for coarse-grain

42

multithreading to be effective. Therefore, fine-grain or simultaneous multithreading is necessary to effectively eliminate stalls in a pipelined associative SIMD processor. Since the version of the MTASC Processor described in this dissertation can only issue a single instruction per cycle, it uses fine-grain multithreading.

4.2. Thread Scheduling

Every cycle the control unit‘s scheduler must select an instruction from one of the active threads to issue to the execution pipeline. The choice of scheduling policy can have a significant impact on both performance and implementation complexity.

The key performance metric for a scheduler is utilization. Utilization is the ratio of cycles spent doing useful work to the total number of cycles. Wasted cycles can be di- vided into two types: stall cycles and wait cycles. Stall cycles are caused by dependencies between instructions in the pipeline. Wait cycles are caused by thread synchronization operations. In order to achieve the best possible utilization, a scheduler must address both of these.

For my dissertation research, I considered three different scheduler implementa- tions: the simple scheduler, the dependency-aware scheduler, and the semaphore-aware scheduler. Each scheduler is described in this chapter, and then compared through simu- lation in Chapter 8.

4.2.1. Simple Scheduler

The simple scheduler uses a strict round-robin selection policy. The scheduler maintains a counter to identify the next thread to execute. Every cycle, this counter is in-

43

cremented, unless the counter's value exceeds the number of threads, in which case the counter is reset to zero.

The next instruction for the selected thread is retrieved from the instruction buffer and sent to the decode unit. If the instruction is not a semaphore wait (wait), the decode unit performs a dependency check between the instruction and any preceding instructions from the same thread still in the pipeline. If the instruction is a wait, the decode unit checks the value of the specified semaphore register to see if the thread is blocked.

If the decode unit determines that the instruction can execute, then it moves on to the appropriate pipeline stage according to the instruction type and the next instruction for that thread is fetched into the instruction buffer. However, if the decode unit deter- mines that the instruction cannot execute, due to either a pipeline dependency or a blocked semaphore wait operation, nothing is issued to the next pipeline stage and the current instruction remains in the buffer so that can be attempted again in the future.

The organization of a control unit based on this scheduler is shown in Figure 4.2.

Note that unlike the other schedulers discussed later, the simple scheduler does not need to examine the instructions to make its selection, so the scheduling logic (which is just a counter and a ) can be placed before the decode unit, thereby requiring only a single decode unit.

44

Figure 4.2 Simple scheduler.

The primary advantage of the simple scheduler is that it requires few hardware re- sources to implement, and is able to select a thread quickly. Unfortunately, the simplistic selection policy leads to very poor utilization. If the instruction from the selected thread is unable to execute due to either a pipeline dependency or a blocked semaphore wait op- eration, that cycle is wasted. The simple scheduler also performs poorly if the software does not use all of the threads implemented in the hardware. The unused threads will still be given an execution slot, which will always be wasted. One possible solution is to al- low software to change the maximum counter value. This solution works well if the soft- ware knows in advance how many threads it will need, but is unable to deal with dynamic thread allocation and deallocation.

4.2.2. Dependency-Aware Scheduler

The dependency-aware scheduler improves upon the simple scheduler by taking into account any dependencies between the instructions waiting to issue and the instruc- tions already in the pipeline. The scheduler employs multiple decode units to perform a

45

Figure 4.3 Breakdown of execution time into useful cycles, stall cycles and wait cycles. dependency check on the next instruction from every thread simultaneously. It then se- lects a thread with an instruction that is able to execute. If the selected thread's instruction is a wait, then the semaphore register is checked to determine whether the thread is blocked. If the instruction is not a wait or if the thread is not blocked, the instruction is moved on to the appropriate pipeline stage.

Since it is possible for multiple threads to have instructions ready to execute in a given cycle, the dependency-aware scheduler needs some mechanism for choosing a thread. To do this, each thread is given a priority and the scheduler selects highest priori- ty thread with an instruction that is ready to execute. If fairness is not a concern, thread priorities can be assigned statically; otherwise priorities must be rotated on a regular basis to ensure that every thread with an executable instruction gets a chance to issue.

Given a sufficient number of threads, the dependency-aware scheduler is able to

46

Figure 4.4 Dependency-aware scheduler. completely eliminate stall cycles. However, it is unable to reduce wait cycles, which tend to dominate as the number of threads increases. Figure 4.3 demonstrates this effect for a sample associative algorithm (Associative QuickHull, which is described in the next sec- tion).

Figure 4.4 shows the organization of a control unit that implements the dependen- cy-aware scheduler. Unlike the simple scheduler, the instructions must be decoded before they can be scheduled, so there must be one decode unit per thread. Each decode unit must have access to the instruction status table, which tracks information about instruc- tions currently being executed, in order to determine if its thread's next instruction will be able to execute. Additionally, the scheduler requires a priority encoder to select one thread to execute if more than one is ready. If fairness between threads is a requirement, logic to rotate priority assignments is also necessary.

If the number of hardware threads is small, the addition of extra decode units is not likely to impact performance significantly. However, for a heavily multithreaded pro-

47

cessor, the additional cost in area and/or performance may be unacceptable. In such a processor, a hybrid approach may be best solution. Threads would be divided into clus- ters, with a single decode unit per cluster. Within a cluster threads are scheduled in round-robin order, but between clusters threads are scheduled using the dependency- aware technique. In this way, the number of decode units is limited while still providing some of the benefit of the dependency-aware approach.

4.2.3. Semaphore-Aware Scheduler

The semaphore-aware scheduler goes one step beyond the dependency-aware scheduler by attempting to also reduce those cycles lost to thread synchronization. In ad- dition to checking each thread to see if it can issue without causing a pipeline stall, the semaphore-aware scheduler also checks to see if the thread will be blocked due to a se- maphore wait operation. Like the dependency-aware scheduler, the semaphore-aware scheduler uses multiple decode units to simultaneously check every thread to see if its next instruction is able to execute. But in addition to checking for pipeline dependencies, the decode units in the semaphore-aware scheduler also read the value of the semaphore register if the instruction is a wait. If it is a wait and the thread is blocked, the decode unit reports that the instruction is unable to execute, so that the cycle is not wasted. In this way, the semaphore-aware scheduler is able to further increase utilization, especially in heavily multithreaded code. Whereas the simple and dependency-aware schedulers can perform worse than a non-multithreaded processor if there are too many threads, the se- maphore-aware scheduler maintains the same performance independent of the number of

48

Figure 4.5 Semaphore-aware scheduler. threads.

In order to implement the semaphore-aware scheduler, it is necessary to read the semaphore registers during instruction decode and since threads are decoded in parallel, each decode unit will need its own read port to the semaphore register file, as shown in

Figure 4.5. Unfortunately, this change can have a negative effect on performance. For a semaphore register file with n registers, each read port requires an n-input multiplexer.

Wide can consume a large number of logic resources and add a significant amount of delay in an FPGA-based design. In order to make the implementation efficient, it may be necessary to limit the number of semaphore registers or employ a hybrid design like the one proposed in the previous section.

4.3. Summary

This chapter provided an overview of hardware multithreading, including a dis-

49

cussion of three types of hardware multithreading: fine-grain multithreading, coarse-grain multithreading, and simultaneous multithreading (SMT). Three different thread schedul- ing policies, and their hardware realizations, were discussed and compared. The simple scheduler requires very little hardware to implement, but it performs poorly at maintain- ing utilization. The dependency-aware scheduler performs well at reducing pipeline stalls, but cannot cope with wait cycles, which tend to dominate in heavily multithreaded programs. The semaphore-aware scheduler performs best out of the three, as it is able to reduce both pipeline stalls and wait cycles.

CHAPTER 5

MTASC: A Multithreaded Associative SIMD Processor

This chapter describes the implementation of the MTASC Processor described in the previous chapters of this dissertation. It also contains a description of the assembler and the cycle-accurate instruction-level simulator developed to evaluate the MTASC Pro- cessor. Chapter 7 contains the results generated by the simulator executing the multith- readed associative benchmarks from Chapter 6. Some of the material presented in this chapter was previously published in [Schaffer 2011].

5.1. Instruction Set Architecture

The core of the MTASC Processor‘s instruction set architecture (ISA) is based on

MIPS and other scalar RISC processors. These processors‘ ISAs were designed for pipe- lined implementation and, as such, avoid many hazards and other pitfalls that make pipe- lining difficult or inefficient [Patterson 2009]. This basic RISC ISA is extended with in- structions to enable SIMD parallelism and implement associative computing. These ex- tensions are based on classic SIMD architectures such the STARAN [Batcher 1974] and

MPP [Batcher 1980] as well as more modern SIMDs like the ClearSpeed CSX700

[ClearSpeed 2008]. Finally, the ISA was extended to support fine-grain multithreaded execution.

50

51

PE 1 Control Unit

Parallel Enable Program Local Memory ALU Registers Stack Counter (PC)

PE 2 Instruction Cache

Parallel Enable Local Memory ALU Registers Stack Semaphore Registers

. ALU . . . Scalar Registers PE n

Parallel Enable Data Local Memory ALU Registers Stack Cache

Shared between Threads Replicated for each Thread

Figure 5.1 Registers in the MTASC architecture.

5.1.1. Registers

As shown in Figure 5.1, there are three register sets in the MTASC Processor: scalar registers, parallel registers, and semaphore registers. In addition, each thread has a (PC) register and each PE has an enable .

The scalar register file, located within the scalar execution unit, provides operands for scalar and broadcast instructions and receives results from scalar and reduction in- structions. In order to support fine-grain multithreading, the scalar register file is fully multithreaded with a separate set of scalar registers for each hardware thread context.

52

Each set of scalar registers contains 16 16-bit registers. In MTASC assembly language, these registers are named s0 through s15. Register s0 always contains the value zero, though this is a software convention and not enforced by the hardware.

The parallel register file provides operands for parallel, broadcast, and reduction instructions and receives results from parallel and broadcast instructions. The parallel register file is split between both thread contexts and PEs. Thus, there is a full set of pa- rallel registers for each thread in each PE, with each set of containing 16 16-bit registers.

In MTASC assembly language, the parallel registers are named p0 through p15 and, like the scalar registers, p0 is always zero.

The semaphore registers are used for inter-thread synchronization and are only accessible by the semaphore instructions: signal and wait. There is a single, global set of semaphore registers that is shared by all threads. The original ISA called for 16 16-bit semaphore registers, such that each semaphore register could function as a counting se- maphore. However, during development it became apparent that this could not be imple- mented efficiently in an FPGA-based processor. Thus, the ISA was revised and the se- maphore registers were reduced to 1-bit binary semaphores. In the MTASC assembly language, the semaphore registers are named sem0 through sem15.

Each thread context contains a 16-bit program counter (PC) register that contains the address of the next instruction for that thread. The PC register is not architecturally visible and can only be modified through branch and jump instructions.

Each PE also contains an enable register, which is organized as a 16-element stack of enable bits. The enable stack records the activity state of each PE. An associative

53

Figure 5.2 Instruction formats. search, which identifies all the PEs whose data matches a search key, pushes a new ena- ble bit onto the enable stack of each PE—a 1 if the PE was a responder or a 0 if the PE was a non-responder. After processing, the enable bit is popped off the stack, returning the PEs to their previous state. The use of a stack makes it possible to perform nested searches. A PE is only active when all the bits in its enable stack are 1s. The enable stack is also replicated for each thread context.

5.1.2. Instruction Formats

All instructions in the MTASC Processor are encoded in one of two formats: the register-register format or the register-immediate format, both of which are illustrated in

Figure 5.2. The op field specifies the instruction‘s operation code (or op code). For some instructions, the mode field modifies the op code and specifies how to interpret the regis- ter fields. The rd, ra, and rb fields specify the destination and source registers. The im- med field is an immediate operand.

54

5.1.3. ALU Instructions

The MTASC Processor contains a full set of arithmetic, logic, and comparison in- structions, collectively referred to as ALU instructions. All register-register ALU instruc- tions can operate in one of three modes: scalar, broadcast, and parallel. In scalar mode, both operands are scalar registers and the instruction executes within the scalar execution unit. In broadcast mode, one operand is a scalar register and the other is a parallel regis- ter. The scalar operand is broadcast and the instruction executes within the PE array. In parallel mode, both operands are parallel registers and the instruction executes within the

PE array. The broadcast mode is further divided into two sub-modes based on whether the scalar operand is the left operand (common left) or the right operand (common right).

Most of the ALU instructions also have a register-immediate form where the second operand is an immediate value encoded in the instruction. These instructions can operate is either scalar mode or parallel mode.

5.1.4. Load/Store Instructions

Load and store instructions either read (load) data from or write (store) data to memory. In adherence to RISC design principles, load and store instructions are the only instructions which can access memory. Like MIPS, the only directly supported by the ISA is displacement addressing, though absolute and register indirect can be realized by using s0 or p0 as the base register or by using 0 for the displacement.

Similar to the ALU instructions, load and store instructions can also operate in scalar, broadcast, or parallel mode. In scalar mode, the base register is a scalar register

55

and the data is read from or written to scalar memory. In broadcast mode, the base regis- ter is also a scalar register, but the data is read from or written to parallel memory (the

PEs‘ local memories). Thus, the data transfer is parallel but each PE uses the same ad- dress. In parallel mode, the base register is a parallel register and the instruction accesses parallel memory. Thus, the PEs can be accessing different addresses. There is no separate common left or common right mode for broadcast since there is a single base register.

5.1.5. Branch and Jump Instructions

The MTASC Processor supports conditional branches, unconditional jumps and subroutine call and return instructions similar to those found in scalar RISC processors.

There are no associative-computing-specific branch instructions (e.g., ―branch if any res- ponders‖). Instead, these types of branches are realized by using a reduction instruction followed by an ordinary conditional branch instruction. Thus, to implement ―branch if any responders,‖ one first executes a response count (count), then branches if that result is not zero (bnez). This approach simplifies the control logic and reduces the total num- ber of instructions.

5.1.6. Enable Stack Instructions

As discussed previously, the enable stack controls whether a given PE is enabled

(it executes the instructions it receives from the control unit) or disabled (it ignores most instructions it receives). A PE is enabled when all the bits in its enable stack are 1s; oth- erwise it is disabled. The enable stack instructions are those instructions which modify the enable stack. Enable stack instructions are unique among parallel instructions in that

56

they are always executed by all PEs, regardless of whether they are currently enabled or not. In this way, all the enable stacks in the PE array remain synchronized.

The ifeqz (―if equal to zero‖) and ifnez (―if not equal to zero‖) instructions compare the contents of a parallel register to zero and then push an enable bit onto the stack. These instructions are typically used in combination with one of the comparison instructions (which set the value of a register to 0 or 1) in order to implement basic com- parisons such as equal to, greater than, etc. The else instruction inverts the value of the bit at the top of each PE‘s enable stack. The endif instruction pops a bit off of each PE‘s enable stack. The names of the instructions come from the fact that they are used to im- plement data-parallel if-then-else statements in high-level languages.

5.1.7. Reduction Instructions

Reduction instructions take a value (stored in a parallel register) from each PE and combine them together producing a scalar result, which is then stored in a scalar reg- ister. All reduction instructions respect the enable state of the PEs so that only those PEs which are enabled contribute to the result of the reduction instruction.

The MTASC Processor implements four reduction operators: count, max/maxu

(signed and unsigned maximum), min/minu (signed and unsigned minimum), and sum.

The count instruction is unique in that it does not have a proper operand; instead it oper- ates based solely on the enable state of the PEs producing a count of the number of active

PEs. MTASC does not implement the PickOne operation directly in hardware. Instead it is implemented by having all the active PEs load their IDs into a register and then per-

57

forming a max or min on that register.

5.1.8. Semaphore Instructions

Inter-thread synchronization in the MTASC Processor is accomplished with the semaphore instructions, signal and wait. Both instructions take a single semaphore register as their sole operand. signal releases a locked semaphore. wait locks an avail- able semaphore or, if the semaphore is already locked, blocks the thread from executing.

5.2. Organization

The MTASC Processor has five main subsystems: the control unit, the scalar ex- ecution unit, the PE array, the broadcast/reduction network, and the PE interconnection network. These are illustrated in Figure 5.3.

5.2.1. Control Unit

The control unit is responsible for fetching, decoding, and issuing instructions. As discussed in Chapter 4, after exploring three different thread schedulers, the semaphore- aware scheduler was shown to perform best. The control unit organization based on this scheduler is shown in Figure 5.4. There are three major components: the fetch unit, the decode unit, and the scheduler.

58

Figure 5.3 MTASC organization.

The fetch unit contains a program counter (PC) and an instruction buffer for each thread context. When the fetch unit detects that an instruction buffer is empty, it fetches the next instruction for that thread from the instruction cache. The fetch unit also receives signals from the scalar execution unit (described below) when a branch or jump instruc- tion alters a thread‘s PC.

The decode unit, one for each thread, examines the instruction in its associated in- struction buffer, determines whether the instruction is able to issue and generates the con- trol signals necessary to execute the instruction. The decode unit performs three checks to determine if an instruction can issue: (a) if the instruction uses a scalar register as an ope- rand, the decode unit checks the instruction status table to see if there is a reduction in- struction in the pipeline that will write into that register; (b) if the instruction writes to a

59

Figure 5.4 Control unit organization. scalar register, the decode unit also checks the instruction status table for a structural ha- zard on the scalar register file write port; and (c) if the instruction is a wait instruction, the decode unit examines the semaphore registers to see if the thread is blocked. Assum- ing that the instruction passes both all three checks, the decode unit signals the scheduler that the instruction is ready to issue.

The scheduler selects one of the ready instructions and issues it to the scalar ex- ecution unit. In order to ensure fairness between hardware threads, the scheduler uses a rotating priority scheme to select instructions when there is more than one ready to issue.

The scheduler also updates the instruction status table with the instruction that was issued

(if any).

5.2.2. Scalar Execution Unit

The scalar execution unit is responsible for executing scalar instructions. The or-

60

ganization of the scalar execution unit is similar to that of the PEs, described below. The scalar execution unit contains a general-purpose register file (scalar registers), an arith- metic/logic unit (ALU), and memory (scalar memory). The main difference between the scalar execution unit and a PE is that the scalar execution unit lacks an enable stack and contains extra logic for resolving branches.

Although the scalar execution unit‘s primary responsibility is executing scalar in- struction, all instructions pass through its first stage (scalar register read, SR). This is ne- cessary to support broadcast instructions, which execute in the PE array but use a scalar register as an operand. The types of instructions actually executed by the scalar execution unit include ALU and load/store instructions in scalar mode, and branch and jump in- structions.

5.2.3. Processing Elements (PEs)

The PE array is responsible for executing broadcast and parallel instructions. Re- duction instructions also pass through the first two PE pipeline stages in order to fetch the value of their parallel register operand (if any) and the enable state of the PE. Each PE in the array is identical and they receive the exact same sequence of instructions from the control unit, which they process in lock step. However, each PE can be independently enabled or disabled, causing it to either execute or ignore the instructions it receives.

Figure 5.5 shows the organization of a PE along with the mapping between hard- ware and pipeline stages. Each PE contains a general-purpose register file (parallel regis- ters), an enable stack, an ALU, and local memory (parallel memory).

61

The general-purpose register file provides operands to and receives results from most instructions executed by the PE. In order to support fine-grain multithreading, the register file is split between threads at the hardware level. This is accomplished by pro- viding a large number of registers and then attaching the current thread ID to the register address before passing it on to the register file. For example, a machine with 16 thread contexts would need a register file with 256 registers (16 threads × 16 registers per thread). Data is never transferred between registers from different threads; all inter-thread communication must be done through the memory.

The enable stack controls whether a given PE is enabled (it executes the instruc- tions it receives from the control unit) or disabled (it ignores most instructions it rece- ives). Each enable stack is implemented as a 16-bit register organized as a 16-element stack of enable bits. A PE is enabled only when every element of the stack is a 1. If any element is 0, the PE is disabled. By organizing the enable bits as a stack, it is possible to efficiently support nested conditional statements. To support fine-grain multithreading, there is a separate enable stack register for each thread context.

The ALU supports a standard set of arithmetic, logic, and comparison operations for both signed and unsigned integers. Forwarding paths are provided so that the results of an ALU operation can be sent back to the ALU before they are written into the register file. The ALU also provides a zero/non-zero status bit to the enable stack logic in order to implement the ifeqz and ifnez instructions.

62

Figure 5.5 PE organization.

Each PE also contains a small local memory that acts as a programmer- or compi- ler-controlled cache. The memory is not divided among threads at the hardware level, allowing threads to share data within the memory. If the memory is to be divided, it must be done at the software level.

63

5.2.4. Broadcast/Reduction Network

The broadcast network is a fully pipelined tree network that takes decoded in- structions from the control unit and scalar data from the scalar execution unit and distri- butes them to the PE array. The broadcast network performs no computation, so each node in the network is simply a register. For this reason, the arity (k) of the network can vary easily and should be chosen so as to optimize the network for the particular imple- mentation technology in use. The latency of the broadcast network is logk p cycles, where p is the number of PEs. For the sake of simplicity, a binary tree broadcast network is used for this dissertation.

The reduction network is responsible for executing reduction instructions. The network receives, from each PE, the value of a parallel register and the PE‘s current ena- ble state. These values are then combined together through a fully pipelined binary tree network to produce a scalar result, which is then provided to the scalar execution unit to be stored in a scalar register. Each node in the reduction network contains a reduction functional unit and a pipeline register. The latency of the reduction network is lg p cycles, where p is the number of PEs. Unlike the broadcast network, increasing the arity of the tree is not likely to improve performance. The increased cost of a k-ary functional unit (in both hardware and latency) tends to overcome the benefit of a shorter network.

Figure 5.6 shows the organization of a node in the reduction network. The prepro- cessing logic exists only for the nodes in the first stage. The postprocessing logic is only present in the final stage. The internal stages contain only the functional unit.

64

Figure 5.6 Reduction network node.

5.2.5. PE Interconnection Network

The PE interconnection network connects the PEs together and enables parallel data transfers between them. The version of the MTASC Processor described in this dis- sertation uses a simple linear array network, as this is all that is needed to run the associa- tive benchmarks described in Chapter 7. In this network, the PEs are arranged in a 1D array and each PE has connections to the previous and next PEs in the array. The hard- ware does not support wraparound at the edges of the array, but this can be emulated in software if required.

5.3. MTASC Assembler

The MTASC Simulator executes a program encoded in MTASC machine lan-

65

guage. In order to simplify the process of writing software for the simulator, an assembler was also developed that translates a program written in MTASC assembly language into machine language.

MTASC assembly language is based on MIPS assembly language as described in

[Patterson 2009]. Each statement is either an instruction or an assembler directive. As- sembler directives are available to select the current program section and to write data in various formats. In addition to the text section (.text), where program instructions are written, there are two data sections: one for scalar data (.scalar) and one for parallel data (.parallel).

The MTASC assembler was implemented using the ANTLR parser generator

[Parr 2011] using two passes. The first pass translates the assembly source code into a parse tree and generates the symbol table. The second pass takes the parse tree and sym- bol table and generates the machine language program.

5.4. MTASC Simulator

The MTASC Simulator is a custom cycle-accurate instruction-level simulator written in C++. The use of a software-based simulator reduced development time and made it possible to explore various architectural options. In particular, the simulator was used to compare the performance of the three thread scheduling policies discussed in

Chapter 4. The MTASC Simulator is able to easily vary important system parameters such as number of PEs, number of threads, and broadcast/reduction latency.

The design of the MTASC Simulator is based on a modified Interpreter Pattern

66

[Gamma 1995]. The Machine class encapsulates the machine state, including program counters, registers, and memories; this class also implements the fetch-decode-execute loop. Decoding an instruction results in an instance of the Instruction class, this is then used to execute the instruction by modifying the machine state.

The simulator searches through all the threads and attempts to find one that is ready to issue. A thread must pass two checks to determine whether it can issue: a depen- dency check and a semaphore check. The dependency check looks for dependencies be- tween the next instruction and any instructions awaiting completion in the pending results queue. The semaphore check only applies if the thread‘s next instruction is a semaphore wait instruction, in which case the thread is blocked if the semaphore is unavailable.

Once a thread is located that passes both checks, the instruction is executed. If the in- struction has a result latency greater than a single cycle, such as a reduction instruction, then it is placed in the pending results queue. These checks mirror those performed in the control unit hardware.

5.5. Summary

This chapter presented an overview of the MTASC Processor as it is implemented in this dissertation. In particular, it discussed noteworthy aspects of the ISA and illu- strated the organization of the major subsystems. It also presented a description of the

MTASC Assembler and MTASC Simulator, tools which were used to evaluate the

MTASC Processor and to explore architectural options during its development. The full source code for both tools is listed in Appendix B.

CHAPTER 6

Multithreaded Associative Benchmarks

In order to measure the effect of multithreading on associative programs, a set of five associative algorithms were selected as benchmarks for the MTASC Processor. Each of these benchmarks was implemented for the MTASC architecture and run on the in- struction-level simulator discussed Chapter 5. These particular benchmarks were selected because they span a range from highly associative to minimally associative. The expecta- tion was that the more highly associative algorithms would show a greater improvement in performance from the addition of multithreading than the less associative algorithms.

This chapter will give an overview of each of the algorithms selected. It will dis- cuss how the algorithms had to be adapted, if at all, to run on the MTASC Processor. It will also provide information on their dynamic behavior, such as the frequency of reduc- tion instructions. Some of the material presented in this chapter was previously published in [Schaffer 2011].

6.1. Algorithmic Conventions

The pseudocode used to describe the algorithms in the following sections should be relatively easy to understand. However, there are some conventions specific to asso- ciative computing that deserve special attention.

Select statements are the algorithmic implementation of the associative search. A

67

68

select statement evaluates a parallel conditional expression on each PE, disables those

PEs for which the conditional expression evaluates false (the non-responders), and then executes a sequence of statements. Once the end of the select statement is reached, the disabled PEs are re-enabled before continuing on to the next statement. Many parallel programming languages use a parallel if statement for this operation; however this disser- tation uses the select statement so as to avoid any ambiguity between a standard (scalar) if statement and a parallel if statement. Note that, regardless of whether any PEs are enabled, the statements inside the select statement are always executed.

The select one statement is similar to the basic select statement except that after evaluating the parallel conditional expression, it then selects one PE from the set of res- ponders. Thus, at most one PE will be active when executing the statements inside the select one statement. This makes use of the PickOne function in the ASC model.

Finally, the any modifier can be applied to if and while statements in order to detect whether there are any active PEs or whether there would be any active PEs after evaluating a parallel conditional expression. This uses the AnyResponders function in the

ASC model.

6.2. Associativity

It was expected that the algorithms which make more extensive use of the broad- cast/reduction network would benefit more from the addition of multithreading. In order to quantify this, each algorithm was assigned an associativity rating. The associativity rating was calculated as the percentage of reduction instructions executed relative to the

69

Figure 6.1 Associativity ratings for the benchmarks. total number of instructions executed. Figure 6.1 shows the associativity ratings for each of the selected algorithms.

6.3. Benchmarks

The algorithms selected as benchmarks for the MTASC Processor are: Jarvis

March [Atwah 1996], Minimum Spanning Tree (MST) [Potter 1994], QuickHull [At- wah 1996], Matrix-Vector Multiplication, and VLDC String Matching [Esenwein 1996].

Each of these algorithms is explored in detail below.

6.3.1. Jarvis March

The Associative Jarvis March algorithm [Atwah 1996] computes the upper con- vex hull of a set of points. As the name implies, it is based on the sequential Jarvis March

70

algorithm.

Figure 6.2 is a listing of the algorithm in pseudocode. currentX and currentY are scalar variables (stored in the scalar execution unit) and x, y, hull, and slope are parallel variables (stored in the PEs). currentX and currentY store the coordinates of the last hull point that was selected. Each PE stores a single point. x and y are the coordinates of that point and hull is a Boolean variable indicating whether the point is on the convex hull. slope contains the slope of the line between the last selected hull point and the PE‘s point.

The algorithm starts by locating the point with the smallest x-coordinate (which is assumed to be unique) and marking it as a hull point. The algorithm then processes the remaining points in left to right order. For each point remaining to be processed, the algo- rithm calculates the slope of the line to that point from the previously selected hull point.

The point whose line has the greatest slope is then selected as the next hull point.

The inner loop is relatively small, but it performs a reduction (max) followed im- mediately by a broadcast and then a PickOne, which is itself implemented as a reduction and broadcast. This is why the Jarvis March algorithm has the highest associativity rating

(17%) of all the benchmarks considered.

The Associative Jarvis March algorithm was designed for the ASC model and, as such, cannot natively take advantage of multiple threads. In order to make this algorithm suitable as a multithreaded benchmark, we run the algorithm multiple times. Each run is called a job and the jobs to be run are collected into a shared queue. When executed with a single thread, the jobs are removed from the queue and run in sequential order with mi- nimal overhead. However, when executed with multiple threads, each thread can be ex-

71

Associative Jarvis March

1. Select one PE where x = min(x) and do 2. currentX ← x 3. currentY ← y 4. hull ← true 5. End select 6. 7. While currentX < max(x) do 8. Select PEs where x > currentX and do 9. slope ← (y – currentY) / (x – currentX) 10. 11. Select one PE where slope = max(slope) and do 12. currentX ← x 13. currentY ← y 14. hull ← true 15. End select 16. End select 17. End while

Figure 6.2 Associative Jarvis March algorithm. ecuting a different job at the same time as the other threads. Since the job queue is shared, there will be some overhead as the threads content for its use. This overhead will be proportional to the number of threads.

6.3.2. Minimum Spanning Tree

The associative Minimum Spanning Tree (MST) algorithm [Potter 1994] finds a spanning tree within a weighted graph that has the smallest total weight. It is based on

Prim‘s sequential MST algorithm.

72

Associative Minimum Spanning Tree (MST)

1. candidate ← WAITING 2. 3. Select PEs where weight[root] < INFINITY and do 4. candidate ← YES 5. currentBest ← weight[root] 6. parent ← root 7. End select 8. 9. Select one PE where id = root 10. candidate ← NO 11. End select 12. 13. While there are any PEs where candidate = YES do 14. Select PEs where candidate = YES and do 15. Select one PE where currentBest = min(currentBest) and do 16. candidate ← NO 17. nextNode ← id 18. End select 19. 20. Select PEs where currentBest < weight[nextNode] and do 21. currentBest ← weight[nextNode] 22. parent ← nextNode 23. End select 24. End select 25. 26. Select PEs where candidate = WAITING and do 27. Select PEs where weight[nextNode] < INFINITY and do 28. candidate ← YES 29. currentBest ← weight[nextNode] 30. parent ← nextNode 31. End select 32. End select 33. End while

Figure 6.3 Associative Minimum Spanning Tree (MST) algorithm.

Figure 6.3 shows a listing of the algorithm in pseudocode. The parallel variables are candidate, currentBest, id, parent, and weight (which is an array) and the scalar va-

73

riables are nextNode and root. During the course of the algorithm, each PE tracks its state in the candidate variable, which is either WAITING (the PE‘s vertex is not reachable from the current spanning tree), YES (the PE‘s vertex is reachable from the current spanning tree), or NO (the PE‘s vertex has been added to the spanning tree). For PEs whose candi- date value is YES, currentBest is the weight of the minimum weight edge connecting the

PE‘s vertex to the current spanning tree and parent is the ID of the PE containing the ver- tex attached to that edge. Each PE is assigned a unique id (typically by the hardware) to track connections between vertices. weight is an array of edge weights, indexed by PE

ID, for all the edges that reference the PE‘s vertex. nextNode is the ID of the PE whose vertex was last added to the spanning tree. root is the ID of the PE whose vertex is the spanning tree root.

The algorithm starts by locating the PEs whose vertices are directly connected to the root and marks them as candidates. The algorithm then adds the root to the spanning tree. During each iteration of the main loop, the algorithm chooses the candidate PE whose vertex has the edge with the smallest weight and adds that vertex to the spanning tree. The selected vertex then becomes the new reference point and those PEs whose ver- tices are directly connected to it are added to the set of candidates. This process continues until there are no more candidates.

Like the Jarvis March algorithm, the small inner loop combined with multiple re- ductions and broadcasts results in a high associativity rating (10%).

The Associative MST algorithm is designed for the ASC model. Therefore, like the Associative Jarvis March algorithm, the multithreaded benchmark runs multiple jobs

74

Associative Matrix-Vector Multiplication

1. For i ← 0 to n – 1 do 2. temp ← sum(A[i] × x) 3. 4. Select PEs where i = id and do 5. y ← temp 6. End select 7. End for

Figure 6.4 Associative Matrix-Vector Multiplication algorithm. in order to simulate a multithreaded operation.

6.3.3. Matrix-Vector Multiplication

The Associative Matrix-Vector Multiplication algorithm multiplies a matrix, dis- tributed by column to the PEs, with a vector, also distributed among the PEs.

Figure 6.4 contains the pseudocode listing of this algorithm. i is the loop counter variable and contains the index of current matrix row being multiplied. The matrix is dis- tributed among the PEs with each PE containing a column of the matrix in A. The vectors x and y are also distributed among the PEs, with each PE containing a single element of the vectors. temp is a scalar variable that holds the result of the sum reduction. Once again we make use of the unique ID number assigned to each PE, represented in the algo- rithm by the id variable.

During each iteration of the main loop, the algorithm computes one element of the result vector. It does this by multiplying the elements of the current matrix row with the elements of the vector x in parallel. A sum-reduce operation then combines the partial sums stored in each PE, producing one element of the result vector.

75

Though this algorithm only performs a single reduction for each iteration of the inner loop, the inner loop is very short, resulting in an associativity rating of 7%.

Unlike the other algorithms, which store only a small, fixed amount of data per

PE, the associative matrix-vector multiplication algorithm must store an entire column vector in each PE‘s local memory. PE local memories are generally very small, typically

1–64 KB, so this will limit the size of matrix that can be multiplied in a single pass. For the purposes of the benchmark, it is assumed that there is sufficient local memory to store all the data.

Similar to the previous algorithms, the Associative Matrix-Vector Multiplication algorithm is designed for the single-threaded ASC model. In order to gain a benefit from multithreading, the algorithm is run multiple times.

6.3.4. QuickHull

The Associative QuickHull algorithm [Atwah 1996] also computes the upper convex hull of a set of points. Unlike Jarvis March, this algorithm is based on the sequen- tial QuickHull algorithm.

Figure 6.5 gives the pseudocode for the Associative QuickHull algorithm. Each

PE contains a point. x and y are the coordinates of that point and hull is a Boolean varia- ble indicating whether the point is on the convex hull. area is a temporary parallel varia- ble that stores twice the area of the triangle formed by the PE‘s point and the reference line. The scalar variables, px, py, qx, and qy, store the coordinates of the endpoints of the reference line.

76

Associative QuickHull

1. Select one PE where x = min(x) and do 2. hull ← true 3. px ← x 4. py ← y 5. End select 6. 7. Select one PE where x = max(x) and do 8. hull ← true 9. qx ← x 10. qy ← y 11. End select 12. 13. Add line (px, py), (qx, qy) to the queue 14. 15. While the queue is not empty do 16. Remove line (px, py), (qx, qy) from the queue 17. area ← (qx – px) × (y – py) – (x – px) × (qy – py) 18. 19. Select PEs where area ≥ 0 and do 20. If any PEs are active then 21. Select one PE where area = max(area) and do 22. hull ← true 23. Add line (px, py), (x, y) to the queue 24. Add line (x, y), (qx, qy) to the queue 25. End select 26. End if 27. End select 28. End while

Figure 6.5 Associative QuickHull algorithm.

The algorithm starts by establishing an initial reference line, from the point with the smallest x-coordinate to the point with the greatest x-coordinate (both of which are assumed to be unique). This line is then placed in a queue. The main loop repeatedly pulls reference lines from the queue until the queue is empty. For each line, the algorithm calculates the areas of the triangles formed from the points stored in the PEs and the cur-

77

rent reference line. The point that forms the triangle with the greatest area is chosen as a hull point and the points inside that triangle are then excluded from further consideration.

The points above the triangle‘s left and right lines could still be hull points, so the two lines are placed in the queue, to be processed later.

The Associative QuickHull algorithm is unique among the algorithms considered because it was designed for both the ASC and MASC models. Even though the multiple control unit MASC model is not technically a multithreaded model, it is similar enough that the MASC version of Associative QuickHull can be easily adapted to run in multith- readed mode. The key is that the task of processing a reference line is independent of all of the other reference lines. Thus, the queue can be made into a shared queue and the threads can pull lines from it and process them in parallel. Note that this is different from the job queue used in the multithreaded version of the other benchmarks. In those benchmarks, the queue is initialized with a static set of jobs and the threads only take jobs from the queue. In the multithreaded version of Associative QuickHull, threads not only take jobs from the queue but they also produce new jobs to be executed and place them into the queue.

6.3.5. VLDC String Matching

The Associative Variable-Length-Don‘t-Care (VLDC) String Matching algorithm

[Esenwin 1996] attempts to match a pattern string containing wildcard characters against a text string. Each wildcard character in the pattern string can match any number of cha- racters in the text string.

78

Figure 6.6 contains the pseudocode listing of this algorithm. Each PE contains a single character of the text string. The parallel variables are text, which holds the PE‘s character of the text string; match, which is a Boolean variable indicating whether a match begins with this character; counter, which indicates the number of matched charac- ter prior to this character; and segment, which is an array containing the number of cha- racters matched prior to this character for each pattern segment. The remaining variables are all scalars. pattern is an array containing the pattern string. patternCounter counts the number of pattern characters that have been processed and patternLength counts the number that remain. maxCell contains the ID of the last PE whose character matched the last pattern segment. m and n are the lengths of the pattern string and text string, respec- tively and i and k are loop counters.

Associative VLDC String Matching Algorithm

1. patternLength ← m 2. maxCell ← n + 1 3. 4. If pattern[m – 1] = ‗*‘ then 5. Select PEs where id > 0 and do 6. segment[0] ← 1 7. End select 8. 9. patternCounter ← 1 10. k ← 1 11. End if 12. 13. While patternLength – patternCounter > 0 and maxCell > 0 do 14. patternLength ← patternLength – patternCounter 15. patternCounter ← 0 16. i ← patternLength – 1 17.

79

18. While i ≥ 0 and pattern[i] ≠ ‗*‘ do 19. Select PEs where next(text) = pattern[i] 20. and next(counter) = patternCounter 21. and next(id) < maxCell, then do 22. counter ← next(counter) + 1 23. End select 24. 25. patternCounter ← patternCounter + 1 26. i ← i – 1 27. End while 28. 29. Select PEs where previous(counter) = patternCounter and do 30. segment[k] ← patternCounter 31. End select 32. 33. Select PEs where segment[k] > 0 and do 34. If any PEs are active then 35. maxCell ← max(id) 36. Else 37. maxCell ← 0 38. End if 39. End select 40. 41. counter ← 0 42. patternCounter ← patternCounter + 1 43. k ← k + 1 44. End while 45. 46. k ← k – 1 47. 48. Select PEs where segment[k] > 0 and do 49. match ← true 50. End select 51. 52. If pattern[0] = ‗*‘ then 53. Select PEs where id < maxCell and id > 0 and do 54. match ← true 55. End select 56. End if

Figure 6.6 Associative VLDC String Matching algorithm.

80

The algorithm starts by performing some special processing in the case that the pattern string ends with a wildcard character. The outer while loop processes the pattern string by segments, from right to left. The inner while loop processes the pattern charac- ters within a segment, also from right to left. The algorithm uses the PE interconnection network functions next and previous to communicate match information between neigh- boring PEs. After processing the entire pattern string, the algorithm checks to see if any

PEs contain a match.

This algorithm is the least associative of all those considered, performing only a single reduction per pattern segment, resulting in an associativity rating of only 3%.

It is also notable that this is the only algorithm selected that requires the use of a

PE interconnection network. Since the required network is simple, only having connec- tions to two neighboring PEs, the simulator assumes only a single cycle of latency for network transfers. In the future, however, it may be interesting to measure the effect of multithreading on networks with more complex interconnection patterns and higher trans- fer latency.

The Associative VLDC String Matching algorithm is also an ASC algorithm and so it is necessary to run multiple jobs in order to take advantage of multiple threads.

6.4. Summary

This chapter presented a set of associative algorithms to use as benchmarks for the MTASC Processor. It described each algorithm in detail and discussed issues related to running these algorithms on a multithreaded associative SIMD processor. Chapter 7

81

will show the results of running these benchmarks on the instruction-level simulator de- scribed in Chapter 5.

CHAPTER 7

Simulation Results

This chapter will present the results obtained from running the multithreaded as- sociative benchmarks from Chapter 6 using the MTASC Simulator described in Chap- ter 5. Some of the material presented in this chapter was previously published in [Schaf- fer 2008a], [Schaffer 2008b] and [Schaffer 2011].

7.1. Comparison of Thread Scheduling Policies

Figure 7.1–Figure 7.5 show the results of comparing the three thread scheduling policies discussed in Chapter 4: the simple scheduler, the dependency-aware scheduler, and the semaphore-aware scheduler. The performance of each scheduler was measured for each of the benchmarks described in Chapter 6. The graphs show the utilization as the number of threads increases. Utilization is defined as the percentage of processor cycles that the scheduler was able to issue an instruction to one of the execution units. The simu- lated MTASC Processor had 256 PEs and used fully pipelined, binary tree broadcast and reductions networks, resulting in a broadcast/reduction latency of 16 cycles.

82

83

Figure 7.1 Comparison of thread schedulers executing Jarvis March algorithm.

Figure 7.2 Comparison of thread schedulers executing MST algorithm.

84

Figure 7.3 Comparison of thread schedulers executing Matrix-Vector Mult algorithm.

Figure 7.4 Comparison of thread schedulers executing QuickHull algorithm.

85

Figure 7.5 Comparison of thread schedulers executing VLDC String Match algorithm.

For all five benchmarks, the semaphore-aware scheduler has the highest utiliza- tion, the dependency-aware scheduler has the second highest, and the simple scheduler has the lowest. However, the difference in performance between the three thread schedul- ing policies is most pronounced in the results for the QuickHull benchmark. The simple scheduler and the dependency-aware scheduler actually perform worse with multiple threads than with a single thread. This is due to the dynamic nature of the QuickHull benchmark. In the other benchmarks, the set of tasks is known in advance, so threads are always able to pick up a new task until all the tasks are exhausted (right before the benchmark terminates). However, in the QuickHull benchmark, there is initially only a single task. As threads complete tasks, they produce additional tasks which are then add- ed to the queue. Thus, there a limit to the number of threads that can perform useful work

86

at any particular time with the remaining threads forced to wait. The simple and depen- dency-aware schedulers are unable to schedule around these unused threads, reducing overall utilization.

Given that the semaphore-aware scheduler performs best in all of the benchmarks and considering that it is necessary to achieve any performance improvement at all for the

QuickHull benchmark, it was the obvious choice for the MTASC Processor. The sema- phore-aware scheduler was used for the simulations of single-threaded vs. multithreaded execution in the next section.

7.2. Single-Threaded Execution vs. Multithreaded Execution

Figure 7.6 shows the results of executing each of the benchmarks from Chapter 7 while varying the number of threads. The graph shows the relative improvement in utili- zation from using multiple threads vs. using a single thread. Results are shown for up to eight threads as none of the benchmarks showed any additional improvement beyond eight threads. The simulated MTASC Processor had 256 PEs and used fully pipelined, binary tree broadcast and reductions networks, resulting in a broadcast/reduction latency of 16 cycles.

87

Figure 7.6 Comparison of multithreaded execution vs. single-threaded execution.

Figure 7.7 Comparison of performance improvement for different benchmarks.

88

Adding only three or four threads improves performance for all the benchmarks and that performance improvement is maintained as the number of threads increases. Fur- thermore, the amount of improvement in performance is influenced by the associativity of the benchmark. As can be seen in Figure 7.7, the highly associative benchmarks, like

Jarvis March and MST, show a greater improvement when run in multithreaded mode than the less associative benchmarks, like QuickHull and VLDC String Matching. It is also worth noting that relatively few threads are required to maximize the performance improvement.

7.3. Summary

This chapter presented the results of running the multithreaded associative benchmarks on the MTASC Simulator. It presented results comparing the three thread scheduling policies previously discussed in Chapter 4, showing that the semaphore-aware scheduler performes best. It also presented results comparing multithreaded execution vs. single-threaded execution, showing that three or four threads are generally sufficient to optimize performance for the benchmarks considered.

CHAPTER 8

Related Work

This chapter will discuss two multithreaded data-parallel processors, the Clear-

Speed CSX and the Nvidia Tesla, and compare them to the MTASC Processor described in this dissertation. Some of the material presented in this chapter was previously pub- lished in [Steinfadt 2009].

8.1. ClearSpeed CSX

The ClearSpeed CSX700, developed by ClearSpeed Technology, is the latest in the CSX family of processors. It is a floating-point SIMD designed to accele- rate data-parallel portions of application code [ClearSpeed 2011].

8.1.1. Architecture

The ClearSpeed CSX700, as shown in Figure 8.1, contains two Multithreaded Ar- ray Processor (MTAP) cores connected to a shared interconnect (the ClearConnect ) along with external memory and I/O interfaces. Each MTAP core functions as an inde- pendent SIMD processor, thus the CSX700 is organized much like a symmetric multipro- cessor (SMP) but with SIMD cores instead of scalar cores. As shown in Figure 8.2, each

MTAP core consists of a control unit, a mono (scalar) execution unit, and a poly (paral- lel) execution unit.

89

90

Figure 8.1 ClearSpeed CSX700 [ClearSpeed 2011].

The control unit fetches and decodes instructions and then issues them to either the mono execution unit or the poly execution unit, as appropriate. The control unit sup- ports coarse-grain multithreading in order to allow overlap between execution and I/O operations. Thread scheduling is priority-based; the control unit will always selects the thread with the highest priority that is ready to run. Semaphores are used to synchronize threads, both with each other and with the hardware (such as the I/O controller).

91

Figure 8.2 ClearSpeed Multithreaded Array Processor (MTAP) core [ClearSpeed 2011].

The mono execution unit is a 16-bit RISC processor with an ALU, multiply- accumulate (MAC) unit, 64-bit floating-point unit (FPU) and a general-purpose register file. The mono execution unit executes all scalar instructions, branch and jump instruc-

92

tions, and semaphore instructions. The mono execution unit is pipelined and replicates registers and other state in order to improve multithreading performance.

The poly execution unit is an array of 96 8-bit PEs. Each PE consists of an ALU, a MAC unit, a 64-bit FPU, a general-purpose register file, an enable stack, local memory, and an I/O buffer for programmed I/O. The PEs are connected together through a swazzle network, a 1D linear array network with nearest neighbor connections. The poly execu- tion unit is not pipelined nor is it multithreaded. When a thread switch occurs, the state of each PE must be explicitly saved and restored.

The broadcast/reduction network is not pipelined, but operates asynchronously with respect to the rest of the processor (as does the swazzle network). This allows execu- tion to continue without having to wait for a reduction to complete, but synchronizing the execution pipeline with the reduction network adds overhead to reduction operations. The reduction network itself is relatively simple. The only operation that the network per- forms natively is ―all enables off‖ (AEO), which produces a 1 if every PE is disabled or a

0 if at least one PE is enabled. can use this operation repeatedly to implement logical AND reduction of 8-bit values. However, anything more sophisticated than that must be implemented in software.

8.1.2. Associative Computing

As a SIMD processor, the CSX already fulfills most of the requirements of the

ASC model: it can perform associative searches, it can enable and disable PEs, and it can detect whether any PEs are enabled (AnyResponders). However, it does not natively sup-

93

port either the PickOne operation or maximum/minimum search. PickOne can itself be implemented using the maximum/minimum search operation along with a unique PE identification number (which is available on the CSX). Thus, the key to enabling the CSX to execute associative code is implementing maximum/minimum search.

As mentioned in the previous section, the CSX‘s reduction network is only capa- ble of one function which produces a single bit indicating whether all the PEs are dis- abled. This is similar to the operation of many early associative memories and processors, which could only produce a some/none response indicator. Thus, we can employ Fal- koff‘s [1962] algorithm to perform the maximum/minimum operation bit-serially. Ap- pendix D contains the source code of a library for performing maximum/minimum search on the CSX.

8.1.3. Comparison with MTASC

As a multithreaded SIMD processor, the CSX shares many similarities with the

MTASC Processor. However, the CSX differs from the MTASC Processor in three key areas: pipelined broadcast/reduction, coarse-grain vs. fine-grain multithreading, and sup- port for associative computing.

The broadcast/reduction network in the CSX is not pipelined, but asynchronous.

An asynchronous broadcast/reduction network is able to operate without slowing down the rest of the processor, but it performs poorly in the face of many, back-to-back broad- cast and reduction operations, as is typical in associative code. The MTASC Processor‘s pipelined broadcast/reduction network is capable of initiating and completing a broadcast

94

Table 8.1 Execution times for Maximum/Minimum Functions on ClearSpeed CSX.

Function Execution Time (cycles) max_char 792 max_short 1322 max_int 2329 max_unsigned_char 812 max_unsigned_short 1301 max_unsigned_int 2390 max_float 3073 max_double 5788 min_char 774 min_short 1324 min_int 2271 min_unsigned_char 784 min_unsigned_short 1346 min_unsigned_int 2356 min_float 3109 min_double 5908

and reduction operation every clock cycle.

The multithreading in the CSX is coarse-grain. A thread is allowed to run until it encounters a high latency operation, typically an I/O operation, and then the processor switches to a new thread. In this way, the processor is able to continue execution while

I/O operations complete asynchronously. The MTASC Processor employs fine-grain multithreading, which switches threads every clock cycle. With fine-grain multithreading, it is possible to not only deal with high latency operations, like I/O, but also lower latency operations like broadcasts and reductions.

Finally, the CSX‘s native support for associative computing is limited. While it is possible, as explained in the previous section, to emulate the missing features in software, there is a significant difference in performance. Table 8.1 shows the execution times for

95

the software versions of the maximum and minimum functions on the CSX. (Note that these results were generated on a CSX600, which shares the same internal architecture as the CSX700.) It takes several hundreds or thousands of cycles to complete a single max- imum or minimum reduction on the CSX. An MTASC Processor, built with similar tech- nology, would likely be able to perform the same operations with 96 PEs in tens of cycles.

8.2. Nvidia Tesla

Due to the demand for real-time, high-quality 3D graphics, graphics processing units (GPUs) have advanced considerably over the last decade. They have evolved from special-purpose devices for accelerating 3D graphics to general-purpose, multithreaded parallel processors. The Tesla processor is Nvidia‘s top of the line GPU, designed specif- ically for high-performance, data-parallel computing.

8.2.1. CUDA

Nvidia‘s Compute Unified Device Architecture (CUDA) is a parallel program- ming model that simplifies the task of writing general-purpose software for GPUs like the

Tesla [Nvidia 2010a]. CUDA is implemented as a set of extensions to existing sequential programming languages, like C and Fortran. A CUDA program can contain both host code (which executes on the host CPU) and device code (which executes on the GPU).

The CUDA compiler splits the code, compiling the device code into PTX assembly (dis- cussed below) and passing the host code on to a host compiler.

96

Figure 8.3 CUDA thread hierarchy [Nvidia 2010a].

In the CUDA programming model, the programmer writes a sequential program that calls one or more parallel kernels. A kernel is a function that executes in parallel across a set of parallel threads. The threads that execute a kernel are grouped into thread blocks and these thread blocks are themselves grouped into grids. The threads within a thread block execute concurrently and can cooperate by means of barrier synchronization and shared memory. The thread blocks within a grid are independent of each other and can be executed in parallel or sequentially. The relationship between threads, thread

97

blocks, and grids is illustrated in Figure 8.3.

CUDA threads can access memory in one of three different memory spaces. Each thread has access to a private local memory. Each thread block has a shared memory, which is accessible by all the threads in that block. Finally, there is a global memory that all threads can access. The per-block shared memory is typically used for intra-block communication. The threads within a block must coordinate their access to the shared memory through barrier synchronization. The global memory is used for both intra-grid communication and passing data between kernel executions. Access to the global memo- ry is synchronized through atomic read-modify-write operations. Figure 8.4 shows the relationship between the thread hierarchy and the memory hierarchy.

8.2.2. Architecture

The Nvidia Tesla 20-series GPU is based on the Fermi architecture, shown in

Figure 8.5. The Tesla GPU contains 448 Scalar Processor (SP) cores, organized as 14

Streaming Multiprocessors (SMs), a GigaThread thread scheduler, cache and memory and I/O interfaces. The thread blocks of a CUDA kernel grid are distributed by the thread scheduler to the SMs, one thread block per SM. The threads within the thread block are then executed by the SPs.

98

Figure 8.4 CUDA memory hierarchy [Nvidia 2010a].

Figure 8.6 shows the organization of a streaming multiprocessor (SM). Each SM contains two scheduler/dispatch units, register file, 32 SP cores, 16 load/store units, 4 special function units (SFUs), and instruction and data caches. Each SP core contains an integer ALU and an FPU. The SFUs execute complex functions such as reciprocal, square root, sine, cosine, etc.

99

Figure 8.5 Tesla GPU architecture [Nvidia 2009].

Streaming multiprocessors execute thread blocks using the single-instruction mul- tiple-thread (SIMT) execution model [Nvidia 2010b]. The multiprocessor manages threads in groups of 32 parallel threads called warps. All threads in a warp start at the same instruction address, but each one has its own program counter can branch indepen- dently of the other threads. The SM issues a single stream of instructions for a warp. If all the threads in the warp are at the same instruction address, then execution proceeds at full speed. However, when the threads diverge, due to a data-dependent branch, the SM must execute each branch path sequentially, disabling threads that are not on that branch path.

After executing each branch path, the threads converge again.

100

Figure 8.6 Tesla Streaming Multiprocessor [Nvidia 2009].

The instruction streams from different warps are interleaved and then executed on the SPs. In this way, stalls due to intra-warp dependencies can often be avoided. The streaming multiprocessors in the Tesla GPU contain dual warp schedulers, allowing a single SM to issue instructions from two warps in a single cycle, as shown in Figure 8.7.

101

Figure 8.7 Warp execution in a Tesla SM [Nvidia 2009].

8.2.3. Associative Computing

The ASC model can be implemented on the Tesla by mapping PEs to threads. PEs execute a sequence of instructions broadcast from a control unit, in the same way that threads execute a sequence of instructions issued by a scheduler. Both PEs and threads have a private local memory and both implement a form of flow control based on enabl- ing and disabling the PEs/threads.

However, the Tesla provides no hardware support for reduction operations. Even the AnyResponders operation, which is available on any SIMD processor, does not exist on the Tesla. While threads can be disabled, there is no way to query the hardware to de- termine whether any threads are currently enabled. Furthermore, the process of perform- ing a reduction in software on the Tesla is complicated by the thread hierarchy. To per-

102

form a reduction, it is necessary to first perform the reduction within the thread blocks using shared memory and barrier synchronization. Then the results from each thread block must be combined using global memory and atomic read-modify-write operations.

8.2.4. Comparison with MTASC

While both the MTASC Processor and the Tesla are multithreaded data-parallel processors, their architectures differ significantly. The MTASC Processor, being based on a classic SIMD architecture, uses a large PE array to expose data parallelism and then uses threads to overcome pipeline hazards caused by broadcast and reduction operations.

The Tesla GPU, on the other hand, uses threads to expose data parallelism. Thus, the Tes- la must support a large number of threads, many more than the MTASC Processor.

The MTASC Processor uses a pipelined broadcast/reduction network to support high-speed broadcast and reduction operations in hardware, even for configurations with thousands of PEs. The Tesla GPU, however, does not provide any support for broadcast or reduction operations at the hardware level. The Tesla maintains scalability by parti- tioning: executing threads on SPs, thread blocks on SMs, and (in the future) grids on sep- arate GPUs. However, this makes collective operations like reductions difficult to im- plement efficiently in software.

Finally, the MTASC Processor is designed specifically to support the ASC model.

As mentioned in the previous section, while it is possible to map the ASC model to a Tes- la GPU, the lack of hardware reduction support makes the process difficult and ineffi- cient.

103

8.3. Summary

This chapter discussed two multithreaded data-parallel processors: the ClearSpeed

CSX and the Nvidia Tesla. For each processor, an overview of the architecture was giv- en, including a discussion of how it could implement the ASC model. The processors were also compared it to the MTASC Processor.

CHAPTER 9

Conclusions and Future Work

9.1. Summary

This dissertation has presented the design and implementation of the MTASC

Processor, a multithreaded associative SIMD processor that uses a combination of pipe- lined broadcast and reduction and fine-grain hardware multithreading to overcome the broadcast/reduction bottleneck.

Chapter 2 introduced the Associative Computing Model (ASC), the computational model on which the MTASC Processor is based. Chapter 3 discussed the effect of pipe- lining the broadcast and reduction networks in an associative SIMD processor and showed how pipelining, while effective in reducing clock cycle time, is not a complete solution to the broadcast/reduction bottleneck. Chapter 4 gave an overview of hardware multithreading and showed how a combination of pipelined broadcast and reduction and fine-grain hardware multithreading could be used to solve the problem of the broad- cast/reduction bottleneck.

Chapter 5 provided a detailed overview of the instruction set architecture and or- ganization of the MTASC Processor, which implements the pipelining and multithread- ing enhancements previously mentioned. It also discussed the development of a cycle- accurate instruction-level simulator that could be used to evaluate the MTASC Processor.

Chapter 6 presented a set of five associative algorithms to serve as benchmarks

104

105

and discussed how those algorithms could be adapted to run on the MTASC Processor.

Chapter 7 then provided the results of running those benchmarks on the aforementioned simulator, which showed that MTASC Processor was not only able to improve perfor- mance of those benchmarks for a single-threaded implementation, but that it was able to maintain that improved performance as the broadcast/reduction latency increased. Fur- thermore, the MTASC Processor was able to accomplish this using relatively few hard- ware threads.

Finally, Chapter 8 introduced two similar multithreaded, data-parallel processors, the ClearSpeed CSX and the Nvidia Tesla, and compared them to the MTASC Processor.

9.2. Contributions

The primary contributions of this dissertation are:

1. the evaluation of two different pipeline organizations for a SIMD processor

[Schaffer 2007];

2. the evaluation of three different hardware thread scheduling algorithms for a

multithreaded SIMD processor [Schaffer 2008a, Schaffer 2008b];

3. the design of a novel multithreaded associative SIMD processor [Schaf-

fer 2011];

4. the development of a cycle-accurate instruction-level simulator for the

MTASC Processor [Schaffer 2008b];

5. the adaptation of five associative algorithms to serve as benchmarks for the

MTASC Processor [Schaffer 2011];

106

6. and the comparison of the MTASC Processor with two other similar, com-

mercially available multithreaded data-parallel processors.

While previous research has considered the effects of pipelining SIMD instruction broadcast and SIMD instruction execution separately, this dissertation is the first to ex- plore the design of an end-to-end pipeline that includes broadcast, execution, and reduc- tion. Chapter 3 presents two such pipeline organizations: a unified pipeline and a diversi- fied pipeline. These pipeline organizations are then compared based on the types of ha- zards that are likely to occur in an associative SIMD processor.

There is much existing research on the topic of hardware multithreading for se- quential processors, but very little for SIMD processors. In particular, this dissertation is the first to consider the use of fine-grained multithreading with a SIMD processor. Chap- ter 4 presents three possible implementations of this technique and Chapter 7 presents the results of comparing those implementations using a set of multithreaded associative benchmarks.

Chapter 5 presents the design of the MTASC Processor, the novel fine-grain mul- tithreaded associative SIMD processor. This chapter also presents the design of a fully custom cycle-accurate instruction-level simulator for the MTASC Processor.

While there are existing algorithms for both the single-instruction-stream ASC model and the multiple-instruction-stream MASC model, there are no existing algorithms for a multithreaded ASC model. Three ASC algorithms (Jarvis March, Minimal Spanning

Tree, and VLDC String Matching) and one MASC algorithm (QuickHull) were adapted so that they could be executed on the MTASC Processor. A fifth algorithm, Matrix-

107

Vector Multiplication, was developed specifically for this dissertation. These algorithms are all described in Chapter 6.

Finally, Chapter 8 presents a comparison between the MTASC Processor and two commercially available multithreaded data-parallel processors: the ClearSpeed CSX700 and the Nvidia Tesla. For both of these processors, it is possible to emulate the operations necessary to execute associative code. However, their lack of hardware support for reduc- tions would severely limit their performance. An MTASC Processor built with similar technology would likely outperform both of them when executing associative code.

9.3. Future Work

The focus of this dissertation is improving the performance of SIMD processors and, in particular, associative SIMD processors by addressing the broadcast/reduction bottleneck. However, the broadcast/reduction bottleneck is not the only issue which lim- its the potential performance of SIMD processors. Off-chip I/O and inter-PE communica- tion also represent potential bottlenecks that must be addressed. Future work could look into the possibility of using pipelining, multithreading, and/or other techniques to address these bottlenecks as well.

The set of multithreaded associative SIMD benchmarks could be expanded. In particular, it would be beneficial to adapt more MASC algorithms to the MTASC archi- tecture since these algorithms, once adapted, tend to expose more inter-thread communi- cation and synchronization.

At this point in time, the only way to write software for the MTASC Processor is

108

to write it in MTASC assembly. Future work could include the development of a high- level language compiler for the MTASC architecture to make writing software easier.

The ability to compile programs written in an existing high-level data-parallel language, like ASC or Cn, would be beneficial.

Finally, a limitation of this dissertation work is that the MTASC Processor has not been implemented in hardware. All of the results presented were derived from software- based simulations of the architecture. Future work could address this limitation by im- plementing the MTASC Processor in hardware and verifying the results from the soft- ware simulations.

BIBLIOGRAPHY

Agarwal, Anant, ―Performance Tradeoffs in Multithreaded Processors,‖ IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 5, Sept. 1992, pp. 525–539.

Allen, James D. and David E. Schimmel, ―The Impact of Pipelining on SIMD Architectures,‖ Proceedings of the International Parallel Processing Symposium (IPPS), 1995, pp. 380–387.

Allen, James D. and David E. Schimmel, ―Issues in the Design of High-Performance SIMD Architectures,‖ IEEE Transactions on Parallel and and Distributed Systems, vol. 7, no. 8, Aug. 1996, pp. 818–829.

Atwah, Maher M., Johnnie W. Baker, and Selim Akl, ―An Associative Implementation of Classical Convex Hull Algorithms,‖ Proceedings of the IASTED International Conference on Parallel and and Systems (PDCS), 1996, pp. 435‒438

Batcher, Kenneth E., ―STARAN Parallel Processor System Hardware,‖ Proceedings of the AFIPS National Computer Conference and Exposition, 1974, pp. 405–410.

Batcher, Kenneth E., ―Design of a Massively Parallel Processor,‖ IEEE Transactions on Computers, vol. C-29, no. 9, 1980, pp. 836–840.

Batcher, Kenneth E., ―Bit-Serial Parallel Processing Systems,‖ IEEE Transactions on Computers, vol. C-31, no. 5, 1982, pp. 377–384.

Burger, Doug and James R. Goodman, ―Billion-Transistor Architectures: There and Back Again,‖ Computer, vol. 39, no. 3, Mar. 2004, p. 22.

Chantamas, Wittaya and Johnnie W. Baker, ―A Multiple Associative Model to Support Branches in Data Parallel Applications using the Manager-Worker Paradigm,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Massively Parallel Processing (WMPP), 2005, p. 266.

Chisvin, Lawrence and R. James Duckworth, ―Content-Addressable and Associative Memory: Alternatives to the Ubiquitous RAM,‖ Computer, vol. 22, no. 7, 1989, pp. 51–64.

109

110

Chitalia, Jalpesh K. and Robert A. Walker, ―Efficient Associative SIMD Processing for Non-Tabular Structured Data,‖ Proceedings of the Northeast Workshop on Circuits and Systems (NEWCAS), 2004, pp. 265–268.

ClearSpeed Technology, CSX600/CSX700 Instruction Set Reference Manual, Aug. 2008.

ClearSpeed Technology, CSX700 Floating Point Processor Datasheet, Jan. 2011.

Culler, David, J.P. Singh, and Anoop Gupta, Parallel : A Hardware/Software Approach, Morgan Kaufmann, 1998.

Esenwein, Mary C. and Johnnie W. Baker, ―VLDC String Matching for Associative Computing and Multiple Broadcast Mesh,‖ Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), 1996, pp. 69‒74.

Falkoff, Adin D., ―Algorithms for Parallel Search Memories,‖ Journal of the ACM, vol. 9, no. 4, Oct. 1962, pp. 488–511.

Flynn, Michael J., ―Some Computer Organizations and Their Effectiveness,‖ IEEE Transactions on Computers, vol. C-21, no. 9, 1972, pp. 948–960.

Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995.

Grosspietch, Karl E., ―Associative Processors and Memories: A Survey,‖ IEEE Micro, vol. 12, no. 3, May/June 1992, pp. 12–19.

Hennessy, John L. and David A. Patterson, Computer Architecture: A Quantitative Approach, 4th ed., Morgan Kaufmann, 2007.

Herbordt, Martin C. and Charles C. Weems, ―Associative, Multiassociative, and Hybrid Processing,‖ Associative Processing and Processors, Anargyros Krikelis and Charles C. Weems, eds., IEEE Computer Society Press, 1997, pp. 26–49.

Jin, Mingxian, Johnnie W. Baker, and Kenneth E. Batcher, ―Timings for Associative Operations on the MASC Model,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Massively Parallel Processing (WMPP), 2001, p. 193.

Johnson, Michael, Superscalar Design, Prentice Hall, 1990.

Levy, Henry M., Susan J. Eggers, and Dean M. Tullsen, ―Simultaneous Multithreading: Maximizing On-Chip Parallelism,‖ Proceedings of the International Symposium on Computer Architecture (ISCA), 1995, p. 392.

111

Lilja, David J., ―Reducing the Branch Penalty in Pipelined Processors,‖ Computer, vol. 21, no. 7, July 1988, pp. 47–55.

Meilander, Will, Johnnie W. Baker, and Mingxian Jin, ―Importance of SIMD Computation Reconsidered,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Massively Parallel Processing (WMPP), 2003, p. 266a.

Nvidia Corporation, Fermi Compute Architecture, 2009.

Nvidia Corporation, CUDA C Programming Guide, Nov. 2010.

Nvidia Corporation, PTX: Parallel Thread Execution ISA, Oct. 2010.

Parhami, Behrooz, ―Associative Memories and Processors: An Overview and Selected Bibliography,‖ Proceedings of the IEEE, vol. 61, no. 6, 1973, pp. 722–730.

Parhami, Behrooz, ―Search and Data Selection Algorithms for Associative Processors,‖ Associative Processing and Processors, Anargyros Krikelis and Charles C. Weems, eds., IEEE Computer Society Press, 1997, pp. 10–25.

Parr, Terence, ―ANTLR Parser Generator,‖ Mar. 2011; http://www.antlr.org.

Patterson, David A. and John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 4th ed., 2009.

Potter, Jerry, Associative Computing: A Programming Paradigm for Massively Parallel Computers, Plenum Press, 1992.

Potter, Jerry, Johnnie W. Baker, Stephen Scott, Arvind Bansal, Chokchai Leangsuksun, and Chandra Asthagiri, ―ASC: An Associative-Computing Paradigm,‖ Computer, vol. 27, no. 11, Nov. 1994, pp. 19-25.

Schaffer, Kevin and Robert A. Walker, ―A Prototype Multithreaded Associative SIMD Processor,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Advances in Parallel and Distributed Computing Models (APDCM), 2007, p. 228.

Schaffer, Kevin and Robert A. Walker, ―Using Hardware Multithreading to Overcome Broadcast/Reduction Latency in an Associative SIMD Processor,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)— Workshop on Large-Scale Parallel Processing (LSPP), 2008, p. 289.

Schaffer, Kevin and Robert A. Walker, ―Using Hardware Multithreading to Overcome Broadcast/Reduction Latency in an Associative SIMD Processor (Extended Version),‖ Parallel Processing Letters, vol. 18, no. 4, 2008, pp. 491–509.

112

Schaffer, Kevin and Robert A. Walker, ―MTASC: A Multithreaded Associative SIMD Processor,‖ to appear in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Large-Scale Parallel Processing (LSPP), 2011.

Scherger, Michael, Johnnie W. Baker, and Jerry Potter, ―Multiple Instruction Stream Control for an Associative Model of Parallel Computation,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)— Workshop on Massively Parallel Processing (WMPP), 2003, p. 266b.

Singh, Virdi Sabegh, Hong Wang, and Robert A. Walker, ―Solving the Longest Common Subsequence (LCS) Problem using the Associative ASC Processor with Reconfigurable 2D Mesh,‖ Proceedings of the International Conference on Parallel and Distributed Computing and Systems (PDCS), 1996, pp. 454–459.

Sodan, A.C., Jacob Machina, Arash Deshmeh, Kevin Macnaughton, and Bryan Esbaugh, ―Parallelism via Multithreaded and Multicore CPUs,‖ Computer, vol. PP, no. 99, 2011, p. 1.

Steinfadt, Shannon and Kevin Schaffer, ―Parallel Approaches for SWAMP Sequence Alignment,‖ Proceedings of the Ohio Collaborative Conference for Bioinformatics (OCCBIO), 2009.

Ungerer, Theo, Borut Robic, and Jurij Šilc, ―Multithreaded Processors,‖ Computer Journal, vol. 45, no. 3, 2002, pp. 320–348.

Walker, Robert A., Jerry Potter, Yanping Wang, and Meiduo Wu, ―Implementing Associative Processing: Rethinking Earlier Architectural Decisions,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)— Workshop on Massively Parallel Processing (WMPP), 2001, p. 195.

Wang, Hong and Robert A. Walker, ―Implementing a Scalable ASC Processor,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Massively Parallel Processing (WMPP), 2003, p. 267a.

Wang, Hong, Lei Xie, Meiduo Wu, and Robert A. Walker, ―A Scalable Associative Processor with Applications in Database and Image Processing,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)— Workshop on Massively Parallel Processing (WMPP), 2004, p. 259b.

Wang, Hong and Robert A. Walker, ―A Scalable Pipelined Associative SIMD Array with Reconfigurable PE Interconnection Network for Embedded Applications,‖ Proceedings of the International Conference on Parallel and Distributed Computing and Systems (PDCS), 2005, pp. 667–673.

113

Wang, Hong and Robert A. Walker, ―Implementing a Multiple-Instruction-Stream Associative MASC Processor,‖ Proceedings of the International Conference on Parallel and Distributed Computing Systems (PDCS), 2006, pp. 460–465.

Weems, Charles C. and Anargyros Krikelis, ―Associative Processing and Processors,‖ Associative Processing and Processors, Anargyros Krikelis and Charles C. Weems, eds., IEEE Computer Society Press, 1997, pp. 2–9.

Wu, Meiduo, Robert A. Walker, and Jerry Potter, ―Implementing Associative Search and Responder Resolution,‖ Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)—Workshop on Massively Parallel Processing (WMPP), 2002, p. 246.

Yau, S.S. and H.S. Fung, ―Associative Processor Architecture: A Survey,‖ ACM Computing Surveys, vol. 9, no. 1, Mar. 1977, pp. 3–27.

APPENDIX A

MTASC Instruction Set

A.1. ALU Instructions add rd, ra, rb Add addi rd, ra, immed Add Immediate and rd, ra, rb Bitwise Logical AND andi rd, ra, immed Bitwise Logical AND Immediate cmpeq rd, ra, rb Compare Equal cmpeqi rd, ra, immed Compare Equal Immediate cmple rd, ra, rb Compare Less Than or Equal cmplei rd, ra, immed Compare Less Than or Equal Immediate cmpleiu rd, ra, immed Compare Less Than or Equal Immediate Unsigned cmpleu rd, ra, rb Compare Less Than or Equal Unsigned cmplt rd, ra, rb Compare Less Than cmplti rd, ra, immed Compare Less Than Immediate cmpltiu rd, ra, immed Compare Less Than Immediate Unsigned cmpltu rd, ra, rb Compare Less Than Unsigned div rd, ra, rb Divide divu rd, ra, rb Divide Unsigned mul rd, ra, rb Multiply mulu rd, ra, rb Multiply Unsigned nor rd, ra, rb Bitwise Logical NOR or rd, ra, rb Bitwise Logical OR ori rd, ra, immed Bitwise Logical OR Immediate sll rd, ra, rb Shift Left Logical slli rd, ra, immed Shift Left Logical Immediate sra rd, ra, rb Shift Right Arithmetic srai rd, ra, immed Shift Right Arithmetic Immediate srl rd, ra, rb Shift Right Logical srli rd, ra, rb Shift Right Logical Immediate sub rd, ra, rb Subtract xor rd, ra, rb Bitwise Logical XOR xori rd, ra, immed Bitwise Logical XOR Immediate

114

115

A.2. Load/Store Instructions lb rd, immed(ra) Load Byte lbu rd, immed(ra) Load Byte Unsigned lh rd, immed(ra) Load Halfword lhu rd, immed(ra) Load Halfword Unsigned sb rd, immed(ra) Store Byte sh rd, immed(ra) Store Halfword

A.3. Branch and Jump Instructions beqz ra, immed Branch If Equal to Zero bnez ra, immed Branch If Not Equal to Zero j immed Jump jal immed Jump and Link jalr ra Jump and Link Register jr ra Jump Register

A.4. Enable Stack Instructions ifeqz ra Enable If Equal to Zero ifnez ra Enable If Not Equal to Zero else Invert Enable State endif Restore Previous Enable State

A.5. Reduction Instructions count rd Count PEs max rd, ra Maximum maxu rd, ra Maximum Unsigned min rd, ra Minimum minu rd, ra Minimum Unsigned sum rd, ra Sum

A.6. Semaphore Instructions signal rd Signal Semaphore wait rd Wait Semaphore

116

A.7. Miscelleanous Instructions halt Halt (Simulation) ldpid rd Load PE ID ldtid rd Load Thread ID

APPENDIX B

MTASC Assembler and Simulator

B.1. MTASCAssembly.g grammar MTASCAssembly; options { ASTLabelType = CommonTree; output = AST; }

@header { import java.util.HashMap; import java.util.Map; }

@members { // Symbol table private Map symbolTable = new HashMap();

/** * Returns the symbol table generated during parsing. */ public Map getSymbolTable() { return symbolTable; }

// Sections private final int SECTION_PARALLEL = 0; private final int SECTION_SCALAR = 1; private final int SECTION_TEXT = 2;

// Per-section address counters private int[] counters = new int[3];

// Current section private int section = SECTION_TEXT; }

// Program program : (label_definition | statement)* ;

117

118

label_definition : ID ':' { symbolTable.put($ID.text, new Integer(counters[section])); } -> ; statement : assembler_directive EOL | instruction EOL { counters[section] += 4; } | EOL ;

// Assembler Directive assembler_directive : ASCII s=STRING { counters[section] += $s.text.length() - 2; } -> ^(ASCII $s) | ASCIIZ s=STRING { counters[section] += $s.text.length() - 1; } -> ^(ASCIIZ $s) | BYTE i=INT { counters[section] += 1; } -> ^(BYTE $i) | HALF i=INT { counters[section] += 2; } -> ^(HALF $i) | PARALLEL { section = SECTION_PARALLEL; } | SCALAR { section = SECTION_SCALAR; } | SPACE i=INT { counters[section] += Integer.parseInt($i.text); } -> ^(SPACE $i) | TEXT { section = SECTION_TEXT; } | WORD i=INT { counters[section] += 4; } -> ^(WORD $i) ;

// Instructions instruction : alu_instruction_pp | alu_instruction_ps | alu_instruction_sp | alu_instruction_ss | alu_immediate_instruction_p | alu_immediate_instruction_s | branch_instruction | if_instruction | load_store_instruction_pp | load_store_instruction_ps | load_store_instruction_ss | jump_instruction | jump_register_instruction | network_instruction | no_operand_instruction | reduce_instruction | semaphore_instruction | special_parallel_instruction | special_scalar_instruction ; alu_instruction_pp : o=alu_op d=p_reg ',' a=p_reg ',' b=p_reg -> ^($o $d $a $b) ;

119

alu_instruction_ps : o=alu_op d=p_reg ',' a=p_reg ',' b=s_reg -> ^($o $d $a $b) ; alu_instruction_sp : o=alu_op d=p_reg ',' a=s_reg ',' b=p_reg -> ^($o $d $a $b) ; alu_instruction_ss : o=alu_op d=s_reg ',' a=s_reg ',' b=s_reg -> ^($o $d $a $b) ; alu_op : ADD | AND | CMPEQ | CMPLE | CMPLEU | CMPLT | CMPLTU | DIV | DIVU | OR | MUL | MULU | NOR | SLL | SRA | SRL | SUB | XOR ; alu_immediate_instruction_p : o=alu_immediate_op d=p_reg ',' a=p_reg ',' i=immediate -> ^($o $d $a $i) ; alu_immediate_instruction_s : o=alu_immediate_op d=s_reg ',' a=s_reg ',' i=immediate -> ^($o $d $a $i) ; alu_immediate_op : ADDI | ANDI | CMPEQI | CMPLEI | CMPLEIU | CMPLTI | CMPLTIU | ORI | SLLI | SRAI | SRLI | XORI ; branch_instruction : o=branch_op a=s_reg ',' i=immediate -> ^($o $a $i) ;

120

branch_op : BEQZ | BNEZ ; if_instruction : o=if_op a=p_reg -> ^($o $a) ; if_op : IFEQZ | IFNEZ ; load_store_instruction_pp : o=load_store_op d=p_reg ',' i=immediate '(' a=p_reg ')' -> ^($o $d $a $i) ; load_store_instruction_ps : o=load_store_op d=p_reg ',' i=immediate '(' a=s_reg ')' -> ^($o $d $a $i) ; load_store_instruction_ss : o=load_store_op d=s_reg ',' i=immediate '(' a=s_reg ')' -> ^($o $d $a $i) ; load_store_op : LB | LBU | LH | LHU | SB | SH ; jump_instruction : o=jump_op i=immediate -> ^($o $i) ; jump_op : J | JAL ; jump_register_instruction : o=jump_register_op a=s_reg -> ^($o $a) ; jump_register_op : JALR | JR ; network_instruction : o=network_op d=p_reg ',' a=p_reg -> ^($o $d $a) ; network_op : NNEXT | NPREV

121

; no_operand_instruction : no_operand_op ; no_operand_op : ELSE | ENDIF | HALT | SYSCALL ; reduce_instruction : o=reduce_op d=s_reg ',' a=p_reg -> ^($o $d $a) ; reduce_op : MAX | MAXU | MIN | MINU | SUM ; semaphore_instruction : o=semaphore_op d=g_reg -> ^($o $d) ; semaphore_op : SIGNAL | WAIT ; special_parallel_instruction : o=special_parallel_op d=p_reg -> ^($o $d) ; special_parallel_op : LDPID ; special_scalar_instruction : o=special_scalar_op d=s_reg -> ^($o $d) ; special_scalar_op : COUNT | LDTID ;

// Global (Semaphore) Registers g_reg returns [int number] : G00 { $number = 0; } | G01 { $number = 1; } | G02 { $number = 2; } | G03 { $number = 3; } | G04 { $number = 4; } | G05 { $number = 5; } | G06 { $number = 6; }

122

| G07 { $number = 7; } | G08 { $number = 8; } | G09 { $number = 9; } | G10 { $number = 10; } | G11 { $number = 11; } | G12 { $number = 12; } | G13 { $number = 13; } | G14 { $number = 14; } | G15 { $number = 15; } ;

// Parallel Registers p_reg returns [int number] : P00 { $number = 0; } | P01 { $number = 1; } | P02 { $number = 2; } | P03 { $number = 3; } | P04 { $number = 4; } | P05 { $number = 5; } | P06 { $number = 6; } | P07 { $number = 7; } | P08 { $number = 8; } | P09 { $number = 9; } | P10 { $number = 10; } | P11 { $number = 11; } | P12 { $number = 12; } | P13 { $number = 13; } | P14 { $number = 14; } | P15 { $number = 15; } ;

// Scalar Registers s_reg returns [int number] : S00 { $number = 0; } | S01 { $number = 1; } | S02 { $number = 2; } | S03 { $number = 3; } | S04 { $number = 4; } | S05 { $number = 5; } | S06 { $number = 6; } | S07 { $number = 7; } | S08 { $number = 8; } | S09 { $number = 9; } | S10 { $number = 10; } | S11 { $number = 11; } | S12 { $number = 12; } | S13 { $number = 13; } | S14 { $number = 14; } | S15 { $number = 15; } ;

// Immediate immediate : ID | INT ;

// Assembler Directives ASCII : '.' ('A'|'a') ('S'|'s') ('C'|'c') ('I'|'i') ('I'|'i') ; ASCIIZ : '.' ('A'|'a') ('S'|'s') ('C'|'c') ('I'|'i') ('I'|'i') ('Z'|'z') ;

123

BYTE : '.' ('B'|'b') ('Y'|'y') ('T'|'t') ('E'|'e') ; HALF : '.' ('H'|'h') ('A'|'a') ('L'|'l') ('F'|'f') ; PARALLEL : '.' ('P'|'p') ('A'|'a') ('R'|'r') ('A'|'a') ('L'|'l') ('L'|'l') ('E'|'e') ('L'|'l') ; SCALAR : '.' ('S'|'s') ('C'|'c') ('A'|'a') ('L'|'l') ('A'|'a') ('R'|'r') ; SPACE : '.' ('S'|'s') ('P'|'p') ('A'|'a') ('C'|'c') ('E'|'e') ; TEXT : '.' ('T'|'t') ('E'|'e') ('X'|'x') ('T'|'t') ; WORD : '.' ('W'|'w') ('O'|'o') ('R'|'r') ('D'|'d') ;

// Instructions ADD : ('A'|'a') ('D'|'d') ('D'|'d') ; ADDI : ('A'|'a') ('D'|'d') ('D'|'d') ('I'|'i') ; AND : ('A'|'a') ('N'|'n') ('D'|'d') ; ANDI : ('A'|'a') ('N'|'n') ('D'|'d') ('I'|'i') ; BEQZ : ('B'|'b') ('E'|'e') ('Q'|'q') ('Z'|'z') ; BNEZ : ('B'|'b') ('N'|'n') ('E'|'e') ('Z'|'z') ; CMPEQ : ('C'|'c') ('M'|'m') ('P'|'p') ('E'|'e') ('Q'|'q') ; CMPEQI : ('C'|'c') ('M'|'m') ('P'|'p') ('E'|'e') ('Q'|'q') ('I'|'i') ; CMPLE : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('E'|'e') ; CMPLEI : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('E'|'e') ('I'|'i') ; CMPLEIU : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('E'|'e') ('I'|'i') ('U'|'u') ; CMPLEU : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('E'|'e') ('U'|'u') ; CMPLT : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('T'|'t') ; CMPLTI : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('T'|'t') ('I'|'i') ; CMPLTIU : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('T'|'t') ('I'|'i') ('U'|'u') ; CMPLTU : ('C'|'c') ('M'|'m') ('P'|'p') ('L'|'l') ('T'|'t') ('U'|'u') ; COUNT : ('C'|'c') ('O'|'o') ('U'|'u') ('N'|'n') ('T'|'t') ; DIV : ('D'|'d') ('I'|'i') ('V'|'v') ; DIVU : ('D'|'d') ('I'|'i') ('V'|'v') ('U'|'u') ; ELSE : ('E'|'e') ('L'|'l') ('S'|'s') ('E'|'e') ; ENDIF : ('E'|'e') ('N'|'n') ('D'|'d') ('I'|'i') ('F'|'f') ; HALT : ('H'|'h') ('A'|'a') ('L'|'l') ('T'|'t') ; IFEQZ : ('I'|'i') ('F'|'f') ('E'|'e') ('Q'|'q') ('Z'|'z') ; IFNEZ : ('I'|'i') ('F'|'f') ('N'|'n') ('E'|'e') ('Z'|'z') ; J : ('J'|'j') ; JAL : ('J'|'j') ('A'|'a') ('L'|'l'); JALR : ('J'|'j') ('A'|'a') ('L'|'l') ('R'|'r') ; JR : ('J'|'j') ('R'|'r') ; LB : ('L'|'l') ('B'|'b') ; LBU : ('L'|'l') ('B'|'b') ('U'|'u') ; LDPID : ('L'|'l') ('D'|'d') ('P'|'p') ('I'|'i') ('D'|'d') ; LDTID : ('L'|'l') ('D'|'d') ('T'|'t') ('I'|'i') ('D'|'d') ; LH : ('L'|'l') ('H'|'h') ; LHU : ('L'|'l') ('H'|'h') ('U'|'u') ; MAX : ('M'|'m') ('A'|'a') ('X'|'x') ; MAXU : ('M'|'m') ('A'|'a') ('X'|'x') ('U'|'u') ; MIN : ('M'|'m') ('I'|'i') ('N'|'n') ; MINU : ('M'|'m') ('I'|'i') ('N'|'n') ('U'|'u') ; MUL : ('M'|'m') ('U'|'u') ('L'|'l') ; MULU : ('M'|'m') ('U'|'u') ('L'|'l') ('U'|'u') ; NNEXT : ('N'|'n') ('N'|'n') ('E'|'e') ('X'|'x') ('T'|'t') ; NOR : ('N'|'n') ('O'|'o') ('R'|'r') ; NPREV : ('N'|'n') ('P'|'p') ('R'|'r') ('E'|'e') ('V'|'v') ; OR : ('O'|'o') ('R'|'r') ; ORI : ('O'|'o') ('R'|'r') ('I'|'i') ; SB : ('S'|'s') ('B'|'b') ; SH : ('S'|'s') ('H'|'h') ; SIGNAL : ('S'|'s') ('I'|'i') ('G'|'g') ('N'|'n') ('A'|'a') ('L'|'l') ;

124

SLL : ('S'|'s') ('L'|'l') ('L'|'l') ; SLLI : ('S'|'s') ('L'|'l') ('L'|'l') ('I'|'i') ; SRA : ('S'|'s') ('R'|'r') ('A'|'a') ; SRAI : ('S'|'s') ('R'|'r') ('A'|'a') ('I'|'i') ; SRL : ('S'|'s') ('R'|'r') ('L'|'l') ; SRLI : ('S'|'s') ('R'|'r') ('L'|'l') ('I'|'i') ; SUB : ('S'|'s') ('U'|'u') ('B'|'b') ; SUM : ('S'|'s') ('U'|'u') ('M'|'m') ; SYSCALL : ('S'|'s') ('Y'|'y') ('S'|'s') ('C'|'c') ('A'|'a') ('L'|'l') ('L'|'l') ; WAIT : ('W'|'w') ('A'|'a') ('I'|'i') ('T'|'t') ; XOR : ('X'|'x') ('O'|'o') ('R'|'r') ; XORI : ('X'|'x') ('O'|'o') ('R'|'r') ('I'|'i') ;

// Global (Semaphore) Registers G00 : ('S'|'s') ('E'|'e') ('M'|'m') '0' ; G01 : ('S'|'s') ('E'|'e') ('M'|'m') '1' ; G02 : ('S'|'s') ('E'|'e') ('M'|'m') '2' ; G03 : ('S'|'s') ('E'|'e') ('M'|'m') '3' ; G04 : ('S'|'s') ('E'|'e') ('M'|'m') '4' ; G05 : ('S'|'s') ('E'|'e') ('M'|'m') '5' ; G06 : ('S'|'s') ('E'|'e') ('M'|'m') '6' ; G07 : ('S'|'s') ('E'|'e') ('M'|'m') '7' ; G08 : ('S'|'s') ('E'|'e') ('M'|'m') '8' ; G09 : ('S'|'s') ('E'|'e') ('M'|'m') '9' ; G10 : ('S'|'s') ('E'|'e') ('M'|'m') '10' ; G11 : ('S'|'s') ('E'|'e') ('M'|'m') '11' ; G12 : ('S'|'s') ('E'|'e') ('M'|'m') '12' ; G13 : ('S'|'s') ('E'|'e') ('M'|'m') '13' ; G14 : ('S'|'s') ('E'|'e') ('M'|'m') '14' ; G15 : ('S'|'s') ('E'|'e') ('M'|'m') '15' ;

// Parallel Registers P00 : ('P'|'p') '0' ; P01 : ('P'|'p') '1' ; P02 : ('P'|'p') '2' ; P03 : ('P'|'p') '3' ; P04 : ('P'|'p') '4' ; P05 : ('P'|'p') '5' ; P06 : ('P'|'p') '6' ; P07 : ('P'|'p') '7' ; P08 : ('P'|'p') '8' ; P09 : ('P'|'p') '9' ; P10 : ('P'|'p') '10' ; P11 : ('P'|'p') '11' ; P12 : ('P'|'p') '12' ; P13 : ('P'|'p') '13' ; P14 : ('P'|'p') '14' ; P15 : ('P'|'p') '15' ;

// Scalar Registers S00 : ('S'|'s') '0' ; S01 : ('S'|'s') '1' ; S02 : ('S'|'s') '2' ; S03 : ('S'|'s') '3' ; S04 : ('S'|'s') '4' ; S05 : ('S'|'s') '5' ; S06 : ('S'|'s') '6' ; S07 : ('S'|'s') '7' ; S08 : ('S'|'s') '8' ;

125

S09 : ('S'|'s') '9' ; S10 : ('S'|'s') '10' ; S11 : ('S'|'s') '11' ; S12 : ('S'|'s') '12' ; S13 : ('S'|'s') '13' ; S14 : ('S'|'s') '14' ; S15 : ('S'|'s') '15' ;

// Identifier ID : IdentifierStart IdentifierPart* ; fragment IdentifierStart : '$' | '.' | '_' | 'A'..'Z' | 'a'..'z' ; fragment IdentifierPart : '$' | '.' | '_' | '0'..'9' | 'A'..'Z' | 'a'..'z' ;

// Integer Literal INT : Sign? DecimalDigit+ ; fragment Sign : '+' | '-' ; fragment DecimalDigit : '0'..'9' ;

// String Literal STRING : '\"' ~('\"')* '\"' ;

// End of Line EOL : '\r' '\n' | '\r' | '\n' ;

126

// Comment COMMENT : '#' ~('\r' | '\n')* { $channel = HIDDEN; } ;

// Whitespace WS : ('\t' | '\f' | ' ')+ { $channel = HIDDEN; } ;

B.2. MTASCAssemblyWriter.g tree grammar MTASCAssemblyWriter; options { ASTLabelType = CommonTree; tokenVocab = MTASCAssembly; }

@header { import java.io.*; import java.util.HashMap; import java.util.Map; }

@members { // Sections private final int SECTION_PARALLEL = 0; private final int SECTION_SCALAR = 1; private final int SECTION_TEXT = 2;

// Per-section output streams OutputStream parallelStream; OutputStream scalarStream; OutputStream textStream;

// Currect section private int section = SECTION_TEXT;

// Symbol table private Map symbolTable;

// Modes private final int MODE_PARALLEL = 1; private final int MODE_SCALAR = 0;

127

/** * Opens output files. */ public void openFiles(String baseName) throws IOException { parallelStream = new FileOutputStream(baseName + ".parallel"); scalarStream = new FileOutputStream(baseName + ".scalar"); textStream = new FileOutputStream(baseName + ".text"); }

/** * Closes output files. */ public void closeFiles() throws IOException { parallelStream.close(); scalarStream.close(); textStream.close(); }

/** * Sets the symbol table to use for resolving label references. */ public void setSymbolTable(Map symbolTable) { this.symbolTable = symbolTable; }

/** * Writes a byte to the currect section. */ public void writeByte(int value) { try { switch (section) { case SECTION_PARALLEL: parallelStream.write(value); break;

case SECTION_SCALAR: scalarStream.write(value); break;

case SECTION_TEXT: textStream.write(value); break; } } catch (IOException ex) { ex.printStackTrace(); } }

128

/** * Writes a halfword to the current section. */ public void writeHalfword(int value) { writeByte((value & 0x000000ff) >> 0); writeByte((value & 0x0000ff00) >> 8); }

/** * Writes an I-type instruction to the current section. * * @param op operation code * @param modeD mode for destination register * @param modeA mode for source register * @param regDest destination register * @param regA source register * @param immediate immediate value */ public void writeITypeInstruction(int op, int modeD, int modeA, int regDest, int regA, int immediate) { writeWord(((op & 0x003f) << 26) | ((modeA & 0x0001) << 25) | ((modeD & 0x0001) << 24) | ((regDest & 0x000f) << 20) | ((regA & 0x000f) << 16) | ((immediate & 0xffff) << 0)); }

/** * Writes an R-type instruction to the current section. * * @param op operation code * @param modeA mode for source register A * @param modeB mode for source register B * @param regDest destination register * @param regA source register A * @param regB source register B */ public void writeRTypeInstruction(int op, int modeA, int modeB, int regDest, int regA, int regB) { writeWord(((op & 0x3f) << 26) | ((modeA & 0x01) << 25) | ((modeB & 0x01) << 24) | ((regDest & 0x0f) << 20) | ((regA & 0x0f) << 16) | ((regB & 0x0f) << 12)); }

/** * Writes a word to the current section. */ public void writeWord(int value) { writeByte((value & 0x000000ff) >> 0); writeByte((value & 0x0000ff00) >> 8); writeByte((value & 0x00ff0000) >> 16); writeByte((value & 0xff000000) >> 24);

129

} }

// Start program : statement* ; statement : assembler_directive | instruction | EOL ;

// Assebler Directives assembler_directive : ^(ASCII STRING) { for (int i = 1; i < $STRING.text.length() - 1; i++) writeByte((int) $STRING.text.charAt(i)); } | ^(ASCIIZ STRING) { for (int i = 1; i < $STRING.text.length() - 1; i++) writeByte((int) $STRING.text.charAt(i));

writeByte(0); } | ^(BYTE INT) { writeByte(Integer.parseInt($INT.text)); } | ^(HALF INT) { writeHalfword(Integer.parseInt($INT.text)); } | PARALLEL { section = SECTION_PARALLEL; } | SCALAR { section = SECTION_SCALAR; } | ^(SPACE INT) { for (int i = 0; i < Integer.parseInt($INT.text); i++) writeByte(0); } | TEXT { section = SECTION_TEXT; } | ^(WORD INT) { writeWord(Integer.parseInt($INT.text)); } ;

130

// Instructions instruction : alu_instruction_pp | alu_instruction_ps | alu_instruction_sp | alu_instruction_ss | alu_immediate_instruction_p | alu_immediate_instruction_s | branch_instruction | if_instruction | load_store_instruction_pp | load_store_instruction_ps | load_store_instruction_ss | jump_instruction | jump_register_instruction | network_instruction | no_operand_instruction | reduce_instruction | semaphore_instruction | special_parallel_instruction | special_scalar_instruction ; alu_instruction_pp : ^(o=alu_op d=p_reg a=p_reg b=p_reg) { writeRTypeInstruction($o.op, MODE_PARALLEL, MODE_PARALLEL, $d.number, $a.number, $b.number); } ; alu_instruction_ps : ^(o=alu_op d=p_reg a=p_reg b=s_reg) { writeRTypeInstruction($o.op, MODE_PARALLEL, MODE_SCALAR, $d.number, $a.number, $b.number); } ; alu_instruction_sp : ^(o=alu_op d=p_reg a=s_reg b=p_reg) { writeRTypeInstruction($o.op, MODE_SCALAR, MODE_PARALLEL, $d.number, $a.number, $b.number); } ; alu_instruction_ss : ^(o=alu_op d=s_reg a=s_reg b=s_reg) { writeRTypeInstruction($o.op, MODE_SCALAR, MODE_SCALAR, $d.number, $a.number, $b.number); } ; alu_op returns [int op] : ADD { $op = 0x00; } | AND { $op = 0x02; } | CMPEQ { $op = 0x06; } | CMPLE { $op = 0x08; }

131

| CMPLEU { $op = 0x0b; } | CMPLT { $op = 0x0c; } | CMPLTU { $op = 0x0f; } | DIV { $op = 0x11; } | DIVU { $op = 0x12; } | MUL { $op = 0x26; } | MULU { $op = 0x27; } | NOR { $op = 0x29; } | OR { $op = 0x2b; } | SLL { $op = 0x30; } | SRA { $op = 0x32; } | SRL { $op = 0x34; } | SUB { $op = 0x36; } | XOR { $op = 0x3a; } ; alu_immediate_instruction_p : ^(o=alu_immediate_op d=p_reg a=p_reg i=immediate) { writeITypeInstruction($o.op, MODE_PARALLEL, MODE_PARALLEL, $d.number, $a.number, $i.value); } ; alu_immediate_instruction_s : ^(o=alu_immediate_op d=s_reg a=s_reg i=immediate) { writeITypeInstruction($o.op, MODE_SCALAR, MODE_SCALAR, $d.number, $a.number, $i.value); } ; alu_immediate_op returns [int op] : ADDI { $op = 0x01; } | ANDI { $op = 0x03; } | CMPEQI { $op = 0x07; } | CMPLEI { $op = 0x09; } | CMPLEIU { $op = 0x0a; } | CMPLTI { $op = 0x0d; } | CMPLTIU { $op = 0x0e; } | ORI { $op = 0x2c; } | SLLI { $op = 0x31; } | SRAI { $op = 0x33; } | SRLI { $op = 0x35; } | XORI { $op = 0x3b; } ; branch_instruction : ^(o=branch_op a=s_reg i=immediate) { writeITypeInstruction($o.op, 0, 0, 0, $a.number, $i.value); } ; branch_op returns [int op] : BEQZ { $op = 0x04; } | BNEZ { $op = 0x05; } ;

132

if_instruction : ^(o=if_op a=p_reg) { writeRTypeInstruction($o.op, MODE_PARALLEL, MODE_PARALLEL, 0, $a.number, 0); } ; if_op returns [int op] : IFEQZ { $op = 0x16; } | IFNEZ { $op = 0x17; } ; load_store_instruction_pp : ^(o=load_store_op d=p_reg a=p_reg i=immediate) { writeITypeInstruction($o.op, MODE_PARALLEL, MODE_PARALLEL, $d.number, $a.number, $i.value); } ; load_store_instruction_ps : ^(o=load_store_op d=p_reg a=s_reg i=immediate) { writeITypeInstruction($o.op, MODE_PARALLEL, MODE_SCALAR, $d.number, $a.number, $i.value); } ; load_store_instruction_ss : ^(o=load_store_op d=s_reg a=s_reg i=immediate) { writeITypeInstruction($o.op, MODE_SCALAR, MODE_SCALAR, $d.number, $a.number, $i.value); } ; load_store_op returns [int op] : LB { $op = 0x1c; } | LBU { $op = 0x1d; } | LH { $op = 0x20; } | LHU { $op = 0x21; } | SB { $op = 0x2d; } | SH { $op = 0x2e; } ; jump_instruction : ^(o=jump_op i=immediate) { writeITypeInstruction($o.op, 0, 0, 0, 0, $i.value); } ; jump_op returns [int op] : J { $op = 0x18; } | JAL { $op = 0x19; } ;

133

jump_register_instruction : ^(o=jump_register_op a=s_reg) { writeRTypeInstruction($o.op, 0, 0, 0, $a.number, 0); } ; jump_register_op returns [int op] : JALR { $op = 0x1a; } | JR { $op = 0x1b; } ; network_instruction : ^(o=network_op d=p_reg a=p_reg) { writeRTypeInstruction($o.op, MODE_PARALLEL, MODE_PARALLEL, $d.number, $a.number, 0); } ; network_op returns [int op] : NNEXT { $op = 0x28; } | NPREV { $op = 0x2a; } ; no_operand_instruction : o=no_operand_op { writeRTypeInstruction($o.op, 0, 0, 0, 0, 0); } ; no_operand_op returns [int op] : ELSE { $op = 0x13; } | ENDIF { $op = 0x14; } | HALT { $op = 0x15; } | SYSCALL { $op = 0x38; } ; reduce_instruction : ^(o=reduce_op d=s_reg a=p_reg) { writeRTypeInstruction($o.op, 0, 0, $d.number, $a.number, 0); } ; reduce_op returns [int op] : MAX { $op = 0x22; } | MAXU { $op = 0x23; } | MIN { $op = 0x24; } | MINU { $op = 0x25; } | SUM { $op = 0x37; } ; semaphore_instruction : ^(o=semaphore_op d=g_reg) { writeRTypeInstruction($o.op, 0, 0, $d.number, 0, 0); } ;

134

semaphore_op returns [int op] : SIGNAL { $op = 0x2f; } | WAIT { $op = 0x39; } ; special_parallel_instruction : ^(o=special_parallel_op d=p_reg) { writeRTypeInstruction($o.op, 0, 0, $d.number, 0, 0); } ; special_parallel_op returns [int op] : LDPID { $op = 0x1e; } ; special_scalar_instruction : ^(o=special_scalar_op d=s_reg) { writeRTypeInstruction($o.op, 0, 0, $d.number, 0, 0); } ; special_scalar_op returns [int op] : COUNT { $op = 0x10; } | LDTID { $op = 0x1f; } ;

// Global (Semaphore) Registers g_reg returns [int number] : G00 { $number = 0; } | G01 { $number = 1; } | G02 { $number = 2; } | G03 { $number = 3; } | G04 { $number = 4; } | G05 { $number = 5; } | G06 { $number = 6; } | G07 { $number = 7; } | G08 { $number = 8; } | G09 { $number = 9; } | G10 { $number = 10; } | G11 { $number = 11; } | G12 { $number = 12; } | G13 { $number = 13; } | G14 { $number = 14; } | G15 { $number = 15; } ;

// Parallel Registers p_reg returns [int number] : P00 { $number = 0; } | P01 { $number = 1; } | P02 { $number = 2; } | P03 { $number = 3; } | P04 { $number = 4; } | P05 { $number = 5; } | P06 { $number = 6; } | P07 { $number = 7; } | P08 { $number = 8; }

135

| P09 { $number = 9; } | P10 { $number = 10; } | P11 { $number = 11; } | P12 { $number = 12; } | P13 { $number = 13; } | P14 { $number = 14; } | P15 { $number = 15; } ;

// Scalar Registers s_reg returns [int number] : S00 { $number = 0; } | S01 { $number = 1; } | S02 { $number = 2; } | S03 { $number = 3; } | S04 { $number = 4; } | S05 { $number = 5; } | S06 { $number = 6; } | S07 { $number = 7; } | S08 { $number = 8; } | S09 { $number = 9; } | S10 { $number = 10; } | S11 { $number = 11; } | S12 { $number = 12; } | S13 { $number = 13; } | S14 { $number = 14; } | S15 { $number = 15; } ; immediate returns [int value] : ID { Integer v = symbolTable.get($ID.text);

if (v != null) { $value = v.intValue(); } else { System.err.println("Undefined symbol: " + $ID.text); $value = 0; } } | INT { $value = Integer.parseInt($INT.text); } ;

B.3. Simulator.cpp

#include #include #include "Machine.h" using namespace std;

136

int main(int argc, char *argv[]) { Machine machine;

if (argc != 4) { cerr << "Usage: " << argv[0] << " \n"; return 1; }

machine.loadProgram(argv[1]); machine.loadParallelData(argv[2]); machine.loadScalarData(argv[3]); machine.run();

return 0; }

B.4. Machine.h

#if !defined(MACHINE_H) #define MACHINE_H

#include #include "Instruction.h" const uint32_t PROGRAM_SIZE = 1024; const uint32_t MEMORY_SIZE = 8196; const uint32_t NUM_REGISTERS = 16; const uint32_t NUM_PES = 256; const uint32_t MIN_THREADS = 1; const uint32_t MAX_THREADS = 8; const uint32_t REDUCTION_LATENCY = 16; struct PendingInstruction { uint32_t thread; uint32_t regD; bool valid; }; class Machine { public: Machine();

void loadProgram(const std::string& fileName); void loadParallelData(const std::string& fileName); void loadScalarData(const std::string& fileName);

uint8_t readByte(uint32_t address) const; uint16_t readHalfword(uint32_t address) const; uint32_t readWord(uint32_t address) const; uint8_t readParallelByte(uint32_t pe, uint32_t address) const; uint16_t readParallelHalfword(uint32_t pe, uint32_t address) const; uint32_t readParallelWord(uint32_t pe, uint32_t address) const;

137

std::string readString(uint32_t address) const;

void writeByte(uint32_t address, uint8_t data); void writeHalfword(uint32_t address, uint16_t data); void writeWord(uint32_t address, uint32_t data); void writeParallelByte(uint32_t pe, uint32_t address, uint8_t data); void writeParallelHalfword(uint32_t pe, uint32_t address, uint16_t data); void writeParallelWord(uint32_t pe, uint32_t address, uint32_t data);

uint16_t getGlobalRegister(uint32_t index) const; uint16_t getRegister(uint32_t thread, uint32_t index) const; uint16_t getParallelRegister(uint32_t thread, uint32_t pe, uint32_t index) const; uint16_t getPC(uint32_t thread) const; bool getTopEnableState(uint32_t thread, uint32_t pe) const; void incrementPC(uint32_t thread); bool isDeadlocked(uint32_t numThreads) const; bool isEnabled(uint32_t thread, uint32_t pe) const; bool isRegisterReady(uint32_t thread, uint32_t index) const; void popEnableState(uint32_t thread, uint32_t pe); void pushEnableState(uint32_t thread, uint32_t pe, bool value); void putGlobalRegister(uint32_t index, uint16_t value); void putRegister(uint32_t thread, uint32_t index, uint16_t value); void putParallelRegister(uint32_t thread, uint32_t pe, uint32_t index, uint16_t value); void putPC(uint32_t thread, uint16_t value); void putTopEnableState(uint32_t thread, uint32_t pe, bool value);

bool anyPending() const;

uint32_t scheduleSimple(uint32_t numThreads, uint32_t priorityThread) const; uint32_t scheduleCheckStalls(uint32_t numThreads, uint32_t priorityThread) const; uint32_t schedule(uint32_t numThreads, uint32_t priorityThread) const;

void reset(); void run(); private:

Instruction program[PROGRAM_SIZE]; uint8_t initialParallelMemory[MEMORY_SIZE]; uint8_t parallelMemory[NUM_PES][MEMORY_SIZE]; uint8_t initialMemory[MEMORY_SIZE]; uint8_t memory[MEMORY_SIZE]; uint16_t enableStack[MAX_THREADS][NUM_PES]; uint16_t globalRegisters[NUM_REGISTERS]; uint16_t registers[MAX_THREADS][NUM_REGISTERS]; uint16_t parallelRegisters[MAX_THREADS][NUM_PES][NUM_REGISTERS]; uint16_t pc[MAX_THREADS]; PendingInstruction pendingInstructions[REDUCTION_LATENCY]; };

#endif

138

B.5. Machine.cpp

#include #include #include #include #include

#include "Instruction.h" #include "Machine.h" using namespace std;

Machine::Machine() { reset(); } uint16_t Machine::getGlobalRegister(uint32_t index) const { return globalRegisters[index]; } uint16_t Machine::getRegister(uint32_t thread, uint32_t index) const { return registers[thread][index]; } uint16_t Machine::getParallelRegister(uint32_t thread, uint32_t pe, uint32_t index) const { return parallelRegisters[thread][pe][index]; } uint16_t Machine::getPC(uint32_t thread) const { return pc[thread] * 4; } bool Machine::getTopEnableState(uint32_t thread, uint32_t pe) const { return ((enableStack[thread][pe] & 1) == 1); } void Machine::incrementPC(uint32_t thread) { pc[thread]++; } bool Machine::isDeadlocked(uint32_t numThreads) const { for (uint32_t thread = 0; thread < numThreads; thread++) { if (!program[pc[thread]].willWait(thread, *this)) return false; }

return true; }

139

bool Machine::isEnabled(uint32_t thread, uint32_t pe) const { return (enableStack[thread][pe] == 0xffff); } bool Machine::isRegisterReady(uint32_t thread, uint32_t index) const { for (uint32_t i = 0; i < REDUCTION_LATENCY; i++) { if (pendingInstructions[i].thread == thread && pendingInstructions[i].regD == index && pendingInstructions[i].valid) { return false; } }

return true; } void Machine::popEnableState(uint32_t thread, uint32_t pe) { enableStack[thread][pe] = (enableStack[thread][pe] >> 1) | 0x8000; } void Machine::pushEnableState(uint32_t thread, uint32_t pe, bool value) { enableStack[thread][pe] = (enableStack[thread][pe] << 1) | (value ? 1 : 0); } void Machine::putGlobalRegister(uint32_t index, uint16_t value) { globalRegisters[index] = value; } void Machine::putRegister(uint32_t thread, uint32_t index, uint16_t value) { registers[thread][index] = value; } void Machine::putParallelRegister(uint32_t thread, uint32_t pe, uint32_t index, uint16_t value) { parallelRegisters[thread][pe][index] = value; } void Machine::putPC(uint32_t thread, uint16_t value) { pc[thread] = value / 4; } void Machine::putTopEnableState(uint32_t thread, uint32_t pe, bool value) { enableStack[thread][pe] = (enableStack[thread][pe] & 0xfffe) | (value ? 1 : 0); }

140

bool Machine::anyPending() const { for (uint32_t i = 0; i < REDUCTION_LATENCY; i++) { if (pendingInstructions[i].valid) return true; }

return false; } uint32_t Machine::scheduleSimple(uint32_t numThreads, uint32_t priorityThread) const { return priorityThread; } uint32_t Machine::scheduleCheckStalls(uint32_t numThreads, uint32_t priorityThread) const { // Schedule a thread to run, skip threads that are stalled for (uint32_t i = 0; i < numThreads; i++) { uint32_t thread = (priorityThread + i) % numThreads; Instruction instruction = program[pc[thread]];

// Check for a dependency if (instruction.willStall(thread, *this)) { // Try next one continue; }

// Thread ready to execute return thread; }

// All threads are stalled return priorityThread; } uint32_t Machine::schedule(uint32_t numThreads, uint32_t priorityThread) const { // Schedule a thread to run, skip threads that are stalled or waiting on // a semaphore for (uint32_t i = 0; i < numThreads; i++) { uint32_t thread = (priorityThread + i) % numThreads; //uint32_t thread = i; Instruction instruction = program[pc[thread]];

// Check for a dependency or a wait if (instruction.willStall(thread, *this) || instruction.willWait(thread, *this)) { // Try next one continue; }

141

// Thread ready to execute return thread; }

// All threads are stalled or waiting on a semaphore return priorityThread; } void Machine::reset() { for (uint32_t i = 0; i < NUM_PES; i++) memcpy(parallelMemory[i], initialParallelMemory, MEMORY_SIZE);

memcpy(memory, initialMemory, MEMORY_SIZE);

for (uint32_t i = 0; i < MAX_THREADS; i++) for (uint32_t j = 0; j < NUM_PES; j++) enableStack[i][j] = 0xffff;

for (uint32_t i = 0; i < NUM_REGISTERS; i++) globalRegisters[i] = 0;

for (uint32_t i = 0; i < MAX_THREADS; i++) for (uint32_t j = 0; j < NUM_REGISTERS; j++) registers[i][j] = 0;

for (uint32_t i = 0; i < MAX_THREADS; i++) for (uint32_t j = 0; j < NUM_PES; j++) for (uint32_t k = 0; k < NUM_REGISTERS; k++) parallelRegisters[i][j][k] = 0;

for (uint32_t i = 0; i < MAX_THREADS; i++) pc[i] = 0;

for (uint32_t i = 0; i < REDUCTION_LATENCY; i++) pendingInstructions[i].valid = false; } void Machine::run() { for (uint32_t numThreads = MIN_THREADS; numThreads <= MAX_THREADS; numThreads++) { ofstream traceStream; traceStream.open("instr_trace.txt");

reset();

uint32_t priorityThread = 0; int numCycles = 0; int numStalls = 0; int numWaits = 0; int numInstructions = 0; int numReductions = 0;

while (!isDeadlocked(numThreads)) { // Advance the pipeline for (int i = REDUCTION_LATENCY - 1; i > 0; i--) pendingInstructions[i] = pendingInstructions[i - 1];

142

pendingInstructions[0].valid = false;

// Schedule a thread uint32_t thread = schedule(numThreads, priorityThread);

// Fetch instruction Instruction instruction = program[pc[thread]]; instruction.address = pc[thread];

if (instruction.willStall(thread, *this)) { // Stall traceStream << thread << " Stall\n"; numStalls++; } else if (instruction.willWait(thread, *this)) { // Wait on a semaphore traceStream << thread << " Wait\n"; numWaits++; } else { // Execute traceStream << thread << " Execute: " << instruction.toString() << "\n";

instruction.execute(thread, *this); numInstructions++;

// Place reduction instructions in the pending ops queue if (instruction.isReduction()) { pendingInstructions[0].thread = thread; pendingInstructions[0].regD = instruction.regD; pendingInstructions[0].valid = true; numReductions++; } }

numCycles++; priorityThread = (priorityThread + 1) % numThreads; } cout << NUM_PES << "," << numThreads << "," << numCycles << "," << numStalls << "," << numWaits << "," << numReductions << "\n";

/* cout << "\n"; cout << "Number of PEs: " << NUM_PES << "\n"; cout << "Number of Threads: " << numThreads << "\n"; cout << "Execution Time: " << numCycles << " cycles\n"; cout << "Stalls: " << numStalls << "\n"; cout << "Wait Cycles: " << numWaits << "\n"; cout << "Instruction Count: " << numInstructions << " (" << (numReductions * 100.0f / numInstructions) << "% reductions)\n\n"; */

143

} } void Machine::loadProgram(const string& fileName) { ifstream fileStream(fileName.c_str(), ios::binary);

if (!fileStream) { cerr << "Could not load program file: " << fileName << "\n"; return; }

uint32_t encodedProgram[PROGRAM_SIZE]; fileStream.read((char *) encodedProgram, PROGRAM_SIZE * sizeof(uint32_t));

for (uint32_t i = 0; i < PROGRAM_SIZE; i++) program[i] = Instruction(encodedProgram[i]); } void Machine::loadParallelData(const string& fileName) { ifstream fileStream(fileName.c_str(), ios::binary);

if (!fileStream) { cerr << "Could not load parallel data file: " << fileName << "\n"; return; }

fileStream.read((char *) initialParallelMemory, MEMORY_SIZE); } void Machine::loadScalarData(const string& fileName) { ifstream fileStream(fileName.c_str(), ios::binary);

if (!fileStream) { cerr << "Could not load data file: " << fileName << "\n"; return; }

fileStream.read((char *) initialMemory, MEMORY_SIZE); } uint8_t Machine::readByte(uint32_t address) const { return memory[address]; } uint16_t Machine::readHalfword(uint32_t address) const { return (memory[address + 0] << 0) | (memory[address + 1] << 8); }

144

uint32_t Machine::readWord(uint32_t address) const { return (memory[address + 0] << 0) | (memory[address + 1] << 8) | (memory[address + 2] << 16) | (memory[address + 3] << 24); } uint8_t Machine::readParallelByte(uint32_t pe, uint32_t address) const { return parallelMemory[pe][address]; } uint16_t Machine::readParallelHalfword(uint32_t pe, uint32_t address) const { return (parallelMemory[pe][address + 0] << 0) | (parallelMemory[pe][address + 1] << 8); } uint32_t Machine::readParallelWord(uint32_t pe, uint32_t address) const { return (parallelMemory[pe][address + 0] << 0) | (parallelMemory[pe][address + 1] << 8) | (parallelMemory[pe][address + 2] << 16) | (parallelMemory[pe][address + 3] << 24); } string Machine::readString(uint32_t address) const { string result;

for (uint32_t i = address; i < MEMORY_SIZE; i++) { char ch = (char) readByte(i);

if (ch == 0) break;

result.append(1, ch); }

return result; } void Machine::writeByte(uint32_t address, uint8_t data) { memory[address] = data; } void Machine::writeHalfword(uint32_t address, uint16_t data) { memory[address + 0] = (data & 0x00ff) >> 0; memory[address + 1] = (data & 0xff00) >> 8; }

145

void Machine::writeWord(uint32_t address, uint32_t data) { memory[address + 0] = (data & 0x000000ff) >> 0; memory[address + 1] = (data & 0x0000ff00) >> 8; memory[address + 2] = (data & 0x00ff0000) >> 16; memory[address + 3] = (data & 0xff000000) >> 24; } void Machine::writeParallelByte(uint32_t pe, uint32_t address, uint8_t data) { parallelMemory[pe][address] = data; } void Machine::writeParallelHalfword(uint32_t pe, uint32_t address, uint16_t data) { parallelMemory[pe][address + 0] = (data & 0x00ff) >> 0; parallelMemory[pe][address + 1] = (data & 0xff00) >> 8; } void Machine::writeParallelWord(uint32_t pe, uint32_t address, uint32_t data) { parallelMemory[pe][address + 0] = (data & 0x000000ff) >> 0; parallelMemory[pe][address + 1] = (data & 0x0000ff00) >> 8; parallelMemory[pe][address + 2] = (data & 0x00ff0000) >> 16; parallelMemory[pe][address + 3] = (data & 0xff000000) >> 24; }

B.6. Instruction.h

#if !defined(INSTRUCTION_H) #define INSTRUCTION_H

#include #include class Machine; enum { OP_ADD = 0x00, OP_ADDI = 0x01, OP_AND = 0x02, OP_ANDI = 0x03, OP_BEQZ = 0x04, OP_BNEZ = 0x05, OP_CMPEQ = 0x06, OP_CMPEQI = 0x07, OP_CMPLE = 0x08, OP_CMPLEI = 0x09, OP_CMPLEIU = 0x0a, OP_CMPLEU = 0x0b, OP_CMPLT = 0x0c, OP_CMPLTI = 0x0d, OP_CMPLTIU = 0x0e, OP_CMPLTU = 0x0f, OP_COUNT = 0x10, OP_DIV = 0x11,

146

OP_DIVU = 0x12, OP_ELSE = 0x13, OP_ENDIF = 0x14, OP_HALT = 0x15, OP_IFEQZ = 0x16, OP_IFNEZ = 0x17, OP_J = 0x18, OP_JAL = 0x19, OP_JALR = 0x1a, OP_JR = 0x1b, OP_LB = 0x1c, OP_LBU = 0x1d, OP_LDPID = 0x1e, OP_LDTID = 0x1f, OP_LH = 0x20, OP_LHU = 0x21, OP_MAX = 0x22, OP_MAXU = 0x23, OP_MIN = 0x24, OP_MINU = 0x25, OP_MUL = 0x26, OP_MULU = 0x27, OP_NNEXT = 0x28, OP_NOR = 0x29, OP_NPREV = 0x2a, OP_OR = 0x2b, OP_ORI = 0x2c, OP_SB = 0x2d, OP_SH = 0x2e, OP_SIGNAL = 0x2f, OP_SLL = 0x30, OP_SLLI = 0x31, OP_SRA = 0x32, OP_SRAI = 0x33, OP_SRL = 0x34, OP_SRLI = 0x35, OP_SUB = 0x36, OP_SUM = 0x37, OP_SYSCALL = 0x38, OP_WAIT = 0x39, OP_XOR = 0x3a, OP_XORI = 0x3b }; enum { MODE_SS = 0, MODE_SP = 1, MODE_PS = 2, MODE_PP = 3 }; enum { SYSCALL_PRINTF, SYSCALL_PARALLEL_PRINTF };

147

class Instruction { public: uint32_t address; uint32_t op; uint32_t mode; uint32_t regD; uint32_t regA; uint32_t regB; uint16_t immediate;

Instruction(); Instruction(uint32_t encodedInstruction);

bool execute(int thread, Machine& machine) const; uint16_t executeALUOp(uint16_t a, uint16_t b) const; uint16_t executeReductionOp(uint16_t total, uint16_t a) const; std::string getOpName() const; bool hasImmediate() const; bool isReduction() const; std::string toString() const; bool willStall(unsigned thread, const Machine& machine) const; bool willWait(unsigned thread, const Machine& machine) const; };

#endif

B.7. Instruction.cpp

#define __STDC_LIMIT_MACROS

#include #include #include #include #include

#include "Instruction.h" #include "Machine.h" using namespace std;

Instruction::Instruction() { address = 0; op = 0; mode = 0; regD = 0; regA = 0; regB = 0; immediate = 0; }

Instruction::Instruction(uint32_t encodedInstruction) { address = 0; op = (encodedInstruction & 0xfc000000) >> 26; mode = (encodedInstruction & 0x03000000) >> 24;

148

regD = (encodedInstruction & 0x00f00000) >> 20; regA = (encodedInstruction & 0x000f0000) >> 16; regB = (encodedInstruction & 0x0000f000) >> 12; immediate = static_cast(encodedInstruction & 0x0000ffff); } bool Instruction::execute(int thread, Machine& machine) const { uint16_t a; uint16_t b; uint16_t result;

char buffer[1024];

switch (op) { case OP_ADD: case OP_ADDI: case OP_AND: case OP_ANDI: case OP_CMPEQ: case OP_CMPEQI: case OP_CMPLE: case OP_CMPLEI: case OP_CMPLEIU: case OP_CMPLEU: case OP_CMPLT: case OP_CMPLTI: case OP_CMPLTIU: case OP_CMPLTU: case OP_DIV: case OP_DIVU: case OP_MUL: case OP_MULU: case OP_NOR: case OP_OR: case OP_ORI: case OP_SLL: case OP_SLLI: case OP_SRA: case OP_SRAI: case OP_SRL: case OP_SRLI: case OP_SUB: case OP_XOR: case OP_XORI: // ALU operation switch (mode) { case MODE_SS: // Scalar a = machine.getRegister(thread, regA); b = hasImmediate() ? immediate : machine.getRegister(thread, regB); machine.putRegister(thread, regD, executeALUOp(a, b)); break;

case MODE_SP: // Parallel with common left operand for (unsigned pe = 0; pe < NUM_PES; pe++) {

149

a = machine.getRegister(thread, regA); b = machine.getParallelRegister(thread, pe, regB);

if (machine.isEnabled(thread, pe)) machine.putParallelRegister(thread, pe, regD, executeALUOp(a, b)); }

break;

case MODE_PS: // Parallel with common right operand for (unsigned pe = 0; pe < NUM_PES; pe++) { a = machine.getParallelRegister(thread, pe, regA); b = machine.getRegister(thread, regB);

if (machine.isEnabled(thread, pe)) machine.putParallelRegister(thread, pe, regD, executeALUOp(a, b)); }

break;

case MODE_PP: // Parallel for (unsigned pe = 0; pe < NUM_PES; pe++) { a = machine.getParallelRegister(thread, pe, regA); b = hasImmediate() ? immediate : machine.getParallelRegister(thread, pe, regB);

if (machine.isEnabled(thread, pe)) machine.putParallelRegister(thread, pe, regD, executeALUOp(a, b)); }

break; }

machine.incrementPC(thread); return true; case OP_BEQZ: // Branch if equal to zero if (machine.getRegister(thread, regA) == 0) machine.putPC(thread, immediate); else machine.incrementPC(thread);

return true; case OP_BNEZ: // Branch if not equal to zero if (machine.getRegister(thread, regA) != 0) machine.putPC(thread, immediate); else machine.incrementPC(thread);

150

return true; case OP_COUNT: case OP_MAX: case OP_MAXU: case OP_MIN: case OP_MINU: case OP_SUM: // Reduction switch (op) { case OP_COUNT: result = 0; break; case OP_MAX: result = 0xFFFFu; break; case OP_MAXU: result = 0; break; case OP_MIN: result = 0x7FFFu; break; case OP_MINU: result = 0xFFFFu; break; case OP_SUM: result = 0; break; }

for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) result = executeReductionOp(result, machine.getParallelRegister(thread, pe, regA)); }

machine.putRegister(thread, regD, result); machine.incrementPC(thread); return true; case OP_ELSE: // Invert top of enable stack for (uint32_t pe = 0; pe < NUM_PES; pe++) machine.putTopEnableState(thread, pe, !machine.getTopEnableState(thread, pe));

machine.incrementPC(thread); return true; case OP_ENDIF: // Pop enable stack for (uint32_t pe = 0; pe < NUM_PES; pe++) machine.popEnableState(thread, pe);

machine.incrementPC(thread); return true; case OP_HALT: // Halt return false; case OP_IFEQZ: // Compare equal to zero and push result on enable stack for (uint32_t pe = 0; pe < NUM_PES; pe++) { machine.pushEnableState(thread, pe, machine.getParallelRegister(thread, pe, regA) == 0); }

machine.incrementPC(thread);

151

return true; case OP_IFNEZ: // Compare not equal to zero and push result on enable stack for (uint32_t pe = 0; pe < NUM_PES; pe++) { machine.pushEnableState(thread, pe, machine.getParallelRegister(thread, pe, regA) != 0); }

machine.incrementPC(thread); return true; case OP_J: // Jump machine.putPC(thread, immediate); return true; case OP_JAL: // Jump and link machine.putRegister(thread, 15, machine.getPC(thread) + 4); machine.putPC(thread, immediate); return true; case OP_JALR: // Jump and link .putRegister(thread, 15, machine.getPC(thread) + 4); machine.putPC(thread, machine.getRegister(thread, regA)); return true; case OP_JR: // Jump register machine.putPC(thread, machine.getRegister(thread, regA)); return true; case OP_LB: case OP_LBU: // Load byte switch (mode) { case MODE_SS: // Scalar machine.putRegister(thread, regD, machine.readByte( machine.getRegister(thread, regA) + immediate));

break;

case MODE_SP: // Parallel with common address for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.putParallelRegister(thread, pe, regD, machine.readParallelByte(pe, machine.getRegister(thread, regA) + immediate)); } }

152

break;

case MODE_PP: // Parallel for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.putParallelRegister(thread, pe, regD, machine.readParallelByte(pe, machine.getParallelRegister(thread, pe, regA) + immediate)); } }

break; }

machine.incrementPC(thread); return true; case OP_LDPID: // Load processor ID for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) machine.putParallelRegister(thread, pe, regD, pe); }

machine.incrementPC(thread); return true; case OP_LDTID: // Load thread ID machine.putRegister(thread, regD, thread); machine.incrementPC(thread); return true; case OP_LH: case OP_LHU: // Load halfword switch (mode) { case MODE_SS: // Scalar machine.putRegister(thread, regD, machine.readHalfword( machine.getRegister(thread, regA) + immediate));

break;

case MODE_SP: // Parallel with common address for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.putParallelRegister(thread, pe, regD, machine.readParallelHalfword(pe, machine.getRegister(thread, regA) + immediate));

153

} }

break;

case MODE_PP: // Parallel for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.putParallelRegister(thread, pe, regD, machine.readParallelHalfword(pe, machine.getParallelRegister(thread, pe, regA) + immediate)); } }

break; }

machine.incrementPC(thread); return true; case OP_NNEXT: // Network move from next for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { uint16_t data = 0;

if (pe < NUM_PES - 1) data = machine.getParallelRegister(thread, pe + 1, regA);

machine.putParallelRegister(thread, pe, regD, data); } }

machine.incrementPC(thread); return true; case OP_NPREV: // Network move from previous for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { uint16_t data = 0;

if (pe > 0) data = machine.getParallelRegister(thread, pe - 1, regA);

machine.putParallelRegister(thread, pe, regD, data); } }

machine.incrementPC(thread); return true;

154

case OP_SB: // Store byte switch (mode) { case MODE_SS: // Scalar machine.writeByte( machine.getRegister(thread, regA) + immediate, (unsigned char) machine.getRegister(thread, regD));

break;

case MODE_SP: // Parallel with common address for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.writeParallelByte(pe, machine.getRegister(thread, regA) + immediate, (unsigned char) machine.getParallelRegister(thread, pe, regD)); } }

break;

case MODE_PP: // Parallel for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.writeParallelByte(pe, machine.getParallelRegister(thread, pe, regA) + immediate, (unsigned char) machine.getParallelRegister(thread, pe, regD)); } }

break; }

machine.incrementPC(thread); return true; case OP_SH: // Store halfword switch (mode) { case MODE_SS: // Scalar machine.writeHalfword( machine.getRegister(thread, regA) + immediate, machine.getRegister(thread, regD));

break;

case MODE_SP:

155

// Parallel with common address for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.writeParallelHalfword(pe, machine.getRegister(thread, regA) + immediate, machine.getParallelRegister(thread, pe, regD)); } }

break;

case MODE_PP: // Parallel for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { machine.writeParallelHalfword(pe, machine.getParallelRegister(thread, pe, regA) + immediate, machine.getParallelRegister(thread, pe, regD)); } }

break; }

machine.incrementPC(thread); return true; case OP_SIGNAL: // Signal semaphore machine.putGlobalRegister(regD, 0); machine.incrementPC(thread); return true; case OP_SYSCALL: // System call switch (machine.getRegister(thread, 11)) { case SYSCALL_PRINTF: sprintf(buffer, machine.readString(machine.getRegister(thread, 12)).c_str(), machine.getRegister(thread, 13), machine.getRegister(thread, 14));

cout << buffer << "\n"; break;

case SYSCALL_PARALLEL_PRINTF: for (uint32_t pe = 0; pe < NUM_PES; pe++) { if (machine.isEnabled(thread, pe)) { sprintf(buffer, machine.readString( machine.getRegister(thread, 12)).c_str(), machine.getParallelRegister(thread, pe, 11),

156

machine.getParallelRegister(thread, pe, 12), machine.getParallelRegister(thread, pe, 13), machine.getParallelRegister(thread, pe, 14));

cout << buffer << "\n"; } }

break; }

machine.incrementPC(thread); return true;

case OP_WAIT: // Wait for semaphore if (machine.getGlobalRegister(regD) == 0) { machine.putGlobalRegister(regD, 1); machine.incrementPC(thread); return true; }

return false; }

cerr << "Attempted to execute undefined instruction at address " << address << " on thread " << thread << "\n";

return false; } uint16_t Instruction::executeALUOp(uint16_t a, uint16_t b) const { switch (op) { case OP_ADD: case OP_ADDI: return a + b;

case OP_AND: case OP_ANDI: return a & b;

case OP_CMPEQ: case OP_CMPEQI: return (a == b) ? UINT16_MAX : 0;

case OP_CMPLE: case OP_CMPLEI: return (static_cast(a) <= static_cast(b)) ? UINT16_MAX : 0;

case OP_CMPLEIU: case OP_CMPLEU: return (a <= b) ? UINT16_MAX : 0;

case OP_CMPLT: case OP_CMPLTI:

157

return (static_cast(a) < static_cast(b)) ? UINT16_MAX : 0;

case OP_CMPLTIU: case OP_CMPLTU: return (a < b) ? UINT16_MAX : 0;

case OP_DIV: return static_cast(static_cast(a) / static_cast(b));

case OP_DIVU: return a / b;

case OP_MUL: return static_cast(static_cast(a) * static_cast(b));

case OP_MULU: return a * b;

case OP_NOR: return ~(a | b);

case OP_OR: case OP_ORI: return a | b;

case OP_SLL: case OP_SLLI: return a << b;

case OP_SRA: case OP_SRAI: return static_cast(static_cast(a) >> static_cast(b));

case OP_SRL: case OP_SRLI: return a >> b;

case OP_SUB: return a - b;

case OP_XOR: case OP_XORI: return a ^ b; }

cerr << "Undefined operation in executeALUOp: " << op << "\n"; return 0; } uint16_t Instruction::executeReductionOp(uint16_t total, uint16_t a) const { switch (op) { case OP_COUNT: return (total + 1);

158

case OP_MAX: return (static_cast(a) > static_cast(total)) ? a : total;

case OP_MAXU: return (a > total) ? a : total;

case OP_MIN: return (static_cast(a) < static_cast(total)) ? a : total;

case OP_MINU: return (a < total) ? a : total;

case OP_SUM: return (a + total); }

cerr << "Undefined operation in executeReductionOp: " << op << "\n"; return 0; } string Instruction::getOpName() const { switch (op) { case OP_ADD: return "add"; case OP_ADDI: return "addi"; case OP_AND: return "and"; case OP_ANDI: return "andi"; case OP_BEQZ: return "beqz"; case OP_BNEZ: return "bnez"; case OP_CMPEQ: return "cmpeq"; case OP_CMPEQI: return "cmpeqi"; case OP_CMPLE: return "cmple"; case OP_CMPLEI: return "cmplei"; case OP_CMPLEIU: return "cmpleiu"; case OP_CMPLEU: return "cmpleu"; case OP_CMPLT: return "cmplt"; case OP_CMPLTI: return "cmplti"; case OP_CMPLTIU: return "cmpltiu"; case OP_CMPLTU: return "cmpltu"; case OP_COUNT: return "count"; case OP_DIV: return "div"; case OP_DIVU: return "divu"; case OP_ELSE: return "else"; case OP_ENDIF: return "endif"; case OP_HALT: return "halt"; case OP_IFEQZ: return "ifeqz"; case OP_IFNEZ: return "ifnez"; case OP_J: return "j"; case OP_JAL: return "jal"; case OP_JALR: return "jalr"; case OP_JR: return "jr"; case OP_LB: return "lb"; case OP_LBU: return "lbu"; case OP_LDPID: return "ldpid";

159

case OP_LDTID: return "ldtid"; case OP_LH: return "lh"; case OP_LHU: return "lhu"; case OP_MAX: return "max"; case OP_MAXU: return "maxu"; case OP_MIN: return "min"; case OP_MINU: return "minu"; case OP_MUL: return "mul"; case OP_MULU: return "mulu"; case OP_NNEXT: return "nnext"; case OP_NOR: return "nor"; case OP_NPREV: return "nprev"; case OP_OR: return "or"; case OP_ORI: return "ori"; case OP_SB: return "sb"; case OP_SH: return "sh"; case OP_SIGNAL: return "signal"; case OP_SLL: return "sll"; case OP_SLLI: return "slli"; case OP_SRA: return "sra"; case OP_SRAI: return "srai"; case OP_SRL: return "srl"; case OP_SRLI: return "srli"; case OP_SUB: return "sub"; case OP_SUM: return "sum"; case OP_SYSCALL: return "syscall"; case OP_WAIT: return "wait"; case OP_XOR: return "xor"; case OP_XORI: return "xori"; }

return "undefined"; } bool Instruction::hasImmediate() const { switch (op) { case OP_ADDI: case OP_ANDI: case OP_CMPEQI: case OP_CMPLEI: case OP_CMPLEIU: case OP_CMPLTI: case OP_CMPLTIU: case OP_ORI: case OP_SLLI: case OP_SRAI: case OP_SRLI: case OP_XORI: return true; }

return false; }

160

bool Instruction::isReduction() const { switch (op) { case OP_COUNT: case OP_MAX: case OP_MAXU: case OP_MIN: case OP_MINU: case OP_SUM: return true; }

return false; } string Instruction::toString() const { ostringstream answer; answer << getOpName();

switch (op) { case OP_ADD: case OP_AND: case OP_CMPEQ: case OP_CMPLE: case OP_CMPLEU: case OP_CMPLT: case OP_CMPLTU: case OP_DIV: case OP_DIVU: case OP_MUL: case OP_MULU: case OP_NOR: case OP_OR: case OP_SLL: case OP_SRA: case OP_SRL: case OP_SUB: case OP_XOR: switch (mode) { case MODE_SS: answer << " s" << regD << ", s" << regA << ", s" << regB; break;

case MODE_SP: answer << " p" << regD << ", s" << regA << ", p" << regB; break;

case MODE_PS: answer << " p" << regD << ", p" << regA << ", s" << regB; break;

case MODE_PP: answer << " p" << regD << ", p" << regA << ", p" << regB; break; }

161

break; case OP_ADDI: case OP_ANDI: case OP_CMPEQI: case OP_CMPLEI: case OP_CMPLEIU: case OP_CMPLTI: case OP_CMPLTIU: case OP_ORI: case OP_SLLI: case OP_SRAI: case OP_SRLI: case OP_XORI: switch (mode) { case MODE_SS: answer << " s" << regD << ", s" << regA << ", " << (short) immediate; break;

case MODE_PP: answer << " p" << regD << ", p" << regA << ", " << (short) immediate; break; }

break; case OP_BEQZ: case OP_BNEZ: answer << " s" << regA << ", " << immediate; break; case OP_COUNT: case OP_LDTID: answer << " s" << regD; break; case OP_ELSE: case OP_ENDIF: case OP_HALT: case OP_SYSCALL: break; case OP_IFEQZ: case OP_IFNEZ: answer << " p" << regA; break; case OP_J: case OP_JAL: answer << " " << immediate; break; case OP_JALR: case OP_JR: answer << " s" << regA; break;

162

case OP_LB: case OP_LBU: case OP_LH: case OP_LHU: case OP_SB: case OP_SH: switch (mode) { case MODE_SS: answer << " s" << regD << ", " << (short) immediate << "(s" << regA << ")"; break;

case MODE_SP: answer << " p" << regD << ", " << (short) immediate << "(s" << regA << ")"; break;

case MODE_PP: answer << " p" << regD << ", " << (short) immediate << "(p" << regA << ")"; break; }

break;

case OP_LDPID: answer << " p" << regD; break;

case OP_MAX: case OP_MAXU: case OP_MIN: case OP_MINU: case OP_SUM: answer << " s" << regD << ", p" << regA; break;

case OP_NNEXT: case OP_NPREV: answer << " p" << regD << ", p" << regA; break;

case OP_SIGNAL: case OP_WAIT: answer << " g" << regD; break; }

return answer.str(); } bool Instruction::willStall(unsigned thread, const Machine& machine) const { switch (op) { case OP_ADD: case OP_ADDI: case OP_AND: case OP_ANDI:

163

case OP_BEQZ: case OP_BNEZ: case OP_CMPEQ: case OP_CMPEQI: case OP_CMPLE: case OP_CMPLEI: case OP_CMPLEIU: case OP_CMPLEU: case OP_CMPLT: case OP_CMPLTI: case OP_CMPLTIU: case OP_CMPLTU: case OP_DIV: case OP_DIVU: case OP_JALR: case OP_JR: case OP_LB: case OP_LBU: case OP_LH: case OP_LHU: case OP_MUL: case OP_MULU: case OP_NOR: case OP_OR: case OP_ORI: case OP_SB: case OP_SH: case OP_SLL: case OP_SLLI: case OP_SRA: case OP_SRAI: case OP_SRL: case OP_SRLI: case OP_SUB: case OP_XOR: case OP_XORI: if ((mode == MODE_SS || mode == MODE_SP) && !machine.isRegisterReady(thread, regA)) return true; } switch (op) { case OP_ADD: case OP_AND: case OP_CMPEQ: case OP_CMPLE: case OP_CMPLEU: case OP_CMPLT: case OP_CMPLTU: case OP_DIV: case OP_DIVU: case OP_MUL: case OP_MULU: case OP_NOR: case OP_OR: case OP_SLL: case OP_SRA: case OP_SRL: case OP_SUB:

164

case OP_XOR: if ((mode == MODE_SS || mode == MODE_PS) && !machine.isRegisterReady(thread, regB)) return true; }

return false; } bool Instruction::willWait(unsigned thread, const Machine& machine) const { return ((op == OP_HALT) || (op == OP_WAIT && machine.getGlobalRegister(regD) != 0)); }

APPENDIX C

Multithreaded Associative Benchmarks

C.1. Jarvis March

.scalar points: .include remaining_jobs: .half 37

.parallel x: .half 0 y: .half 0

.text

ldtid s2 bnez s2, thread_loop wait sem0

# Scatter points addi s2, s0, points add s3, s0, s0 count s4 ldpid p15 scatter_data: cmplt s5, s3, s4 beqz s5, end_scatter_data cmpeq p2, s3, p15 ifnez p2 lh s5, 0(s2) lh s6, 2(s2) add p2, p0, s5 add p3, p0, s6 sh p2, x(p0) sh p3, y(p0) endif addi s2, s2, 4 addi s3, s3, 1 j scatter_data end_scatter_data: signal sem0

165

166

thread_loop: wait sem0 lh s2, remaining_jobs(s0) beqz s2, thread_exit addi s2, s2, -1 sh s2, remaining_jobs(s0) signal sem0 j job_start thread_exit: signal sem0 halt job_start: lh p2, x(p0) lh p3, y(p0) ldpid p15

min s2, p2 max s3, p2

# Locate leftmost point and mark it as a hull point cmpeq p6, p2, s2 ifnez p6 addi p5, p0, -1 min s4, p2 min s5, p3 endif loop: cmplt s7, s4, s3 beqz s7, end_loop

# Compute slope cmple p6, p2, s4 ifeqz p6 sub p6, p3, s5 sub p7, p2, s4 div p4, p6, p7 max s6, p4

# Locate point with greatest slope and mark it as a hull point cmpeq p6, p4, s6 ifnez p6 min s7, p15 cmpeq p6, s7, p15 ifnez p6 addi p5, p0, -1 min s4, p2 min s5, p3 endif endif

endif j loop end_loop: j thread_loop

167

C.2. Minimal Spanning Tree (MST)

.scalar remaining_jobs: .half 37

.parallel weight: .include

.text

ldtid s2 bnez s2, thread_loop thread_loop: wait sem0 lh s2, remaining_jobs(s0) beqz s2, thread_exit addi s2, s2, -1 sh s2, remaining_jobs(s0) signal sem0 j job_start thread_exit: signal sem0 halt job_start: ldpid p4 addi p1, p0, 2

lh p5, weight(s2) cmpeqi p6, p5, -1 ifeqz p6 addi p1, p0, 1 add p3, p0, s2 add p2, p0, p5 endif

cmpeq p5, p4, s2 ifnez p5 addi p1, p0, 0 endif loop: cmpeqi p5, p1, 1 ifnez p5 count s3 endif beqz s3, end_loop

cmpeqi p5, p1, 1 ifnez p5 min s3, p2

168

cmpeq p5, s3, p2 ifnez p5 min s3, p4 cmpeq p5, s3, p4 ifnez p5 min s1, p4 addi p1, p0, 0 endif endif endif

cmpeqi p5, p1, 1 ifnez p5 lh p5, weight(s1) cmplt p6, p2, p5 ifnez p6 add p2, p0, p5 add p3, p0, s1 endif endif

cmpeqi p5, p1, 2 ifnez p5 lh p5, weight(s1) cmpeqi p6, p5, -1 ifeqz p6 addi p1, p0, 1 add p3, p0, s1 add p2, p0, p5 endif endif

j loop end_loop: j thread_loop

C.3. Matrix-Vector Multiplication

.scalar remaining_jobs: .half 37

.parallel

A: .space 32 x: .half 0 y: .half 0

.text

ldtid s2 bnez s2, thread_loop thread_loop:

169

wait sem0 lh s2, remaining_jobs(s0) beqz s2, thread_exit addi s2, s2, -1 sh s2, remaining_jobs(s0) signal sem0 j job_start thread_exit: signal sem0 halt job_start: ldpid p15 lh p2, x(p0) add s2, s0, s0 loop: cmplti s3, s2, 16 beqz s3, end_loop slli s3, s2, 1 lh p3, A(s3) mul p3, p3, p2 sum s3, p3 add p3, p0, s3 cmpeq p4, s2, p15 ifnez p4 sh p3, y(p0) endif addi s2, s2, 1 j loop end_loop: j thread_loop

C.4. QuickHull

.scalar points: .include head: .half 0 tail: .half 0 px_queue: .space 256 py_queue: .space 256 qx_queue: .space 256 qy_queue: .space 256 remaining_points: .half 101

170

queue_size: .half 0

.parallel x: .half 0 y: .half 0 hull: .half 0

.text

ldtid s2 bnez s2, thread_loop

wait sem0

# Scatter points addi s2, s0, points add s3, s0, s0 count s4 ldpid p15 scatter_data: cmplt s5, s3, s4 beqz s5, end_scatter_data cmpeq p2, s3, p15 ifnez p2 lh s5, 0(s2) lh s6, 2(s2) add p2, p0, s5 add p3, p0, s6 sh p2, x(p0) sh p3, y(p0) endif addi s2, s2, 4 addi s3, s3, 1 j scatter_data end_scatter_data: lh p2, x(p0) lh p3, y(p0)

# Find westernmost point min s6, p2 cmpeq p1, s6, p2 ifnez p1 ldpid p1 # PickOne min s1, p1 cmpeq p1, s1, p1 ifnez p1 addi p4, p0, -1 sh p4, hull(p0) min s2, p2 min s3, p3 endif

171

endif

# Find easternmost point max s6, p2 cmpeq p1, s6, p2 ifnez p1 ldpid p1 # PickOne min s1, p1 cmpeq p1, s1, p1 ifnez p1 addi p4, p0, -1 sh p4, hull(p0) min s4, p2 min s5, p3 endif endif

# Add line segment (w, e) to queue lh s6, tail(s0) sh s2, px_queue(s6) sh s3, py_queue(s6) sh s4, qx_queue(s6) sh s5, qy_queue(s6) addi s6, s6, 2 andi s6, s6, 255 sh s6, tail(s0)

signal sem0 thread_loop: wait sem0 lh s2, remaining_points(s0) signal sem0 beqz s2, thread_exit j job_start thread_exit: halt job_start: lh p2, x(p0) lh p3, y(p0) loop: # Remove next line segment from the queue wait sem0 lh s6, head(s0) lh s7, tail(s0) cmpeq s8, s6, s7 bnez s8, empty_queue lh s2, px_queue(s6) lh s3, py_queue(s6) lh s4, qx_queue(s6) lh s5, qy_queue(s6) addi s6, s6, 2 andi s6, s6, 255 sh s6, head(s0) j release

172

empty_queue: release: signal sem0 bnez s8, end_loop

# Compute area sub s6, s4, s2 sub p6, p3, s3 mul p6, s6, p6 sub p7, p2, s2 sub s6, s5, s3 mul p7, p7, s6 sub p6, p6, p7

# Select points above the line segment cmple p1, p2, s2 ifeqz p1 cmplt p1, p2, s4 ifnez p1 cmplti p1, p6, 0 ifeqz p1 count s1 # AnyResponders beqz s1, end_any

# Find point with greatest area max s6, p6 cmpeq p1, s6, p6 ifnez p1 ldpid p1 # PickOne min s1, p1 cmpeq p1, s1, p1 ifnez p1 min s6, p2 min s7, p3

# Mark it as a hull point wait sem0 addi p4, p0, -1 sh p4, hull(p0)

# Add line segment (p, r) to queue lh s8, tail(s0) sh s2, px_queue(s8) sh s3, py_queue(s8) sh s6, qx_queue(s8) sh s7, qy_queue(s8) addi s8, s8, 2 andi s8, s8, 255 sh s8, tail(s0)

# Add line segment (r, q) to queue lh s8, tail(s0) sh s6, px_queue(s8) sh s7, py_queue(s8) sh s4, qx_queue(s8) sh s5, qy_queue(s8) addi s8, s8, 2 andi s8, s8, 255 sh s8, tail(s0)

173

# Decrement point counter lh s8, remaining_points(s0) addi s8, s8, -1 sh s8, remaining_points(s0) signal sem0 end_any: endif endif endif endif endif

j loop end_loop: j thread_loop

C.5. VLDC String Matching

.scalar remaining_jobs: .half 37 m: .half 0 n: .half 0 pattern: .asciiz "A*A*A*A*A*A*A*A*A*A*A*A*A*A*A*A*A" scalar_text: .asciiz "@ABBBABBBABA"

.parallel segment: .space 256 text: .byte 0

.text

ldtid s2 bnez s2, thread_loop wait sem0

# Compute length of pattern string add s2, s0, s0 get_pattern_length: lbu s3, pattern(s2) beqz s3, end_get_pattern_length addi s2, s2, 1 j get_pattern_length

174

end_get_pattern_length: sh s2, m(s0)

# Scatter text string (and compute length) add s2, s0, s0 ldpid p15 scatter_text: lbu s3, scalar_text(s2) beqz s3, end_scatter_text cmpeq p2, s2, p15 ifnez p2 add p2, p0, s3 sb p2, text(p0) endif addi s2, s2, 1 j scatter_text end_scatter_text: addi s2, s2, -1 sh s2, n(s0)

signal sem0 thread_loop: wait sem0 lh s2, remaining_jobs(s0) beqz s2, thread_exit addi s2, s2, -1 sh s2, remaining_jobs(s0) signal sem0 j job_start thread_exit: signal sem0 halt

# Job start job_start: lh s2, m(s0) lh s3, n(s0) lbu p2, text(p0) ldpid p15

add s4, s0, s2 addi s5, s3, 1

# Special handling for pattern with trailing wildcard addi s9, s2, -1 lbu s9, pattern(s9) cmpeqi s9, s9, 42 beqz s9, segment_loop ifnez p15 addi p5, p0, 1 sh p5, segment(p0) endif addi s6, s0, 1 addi s7, s0, 1

175

segment_loop: sub s4, s4, s6 cmplei s9, s4, 0 bnez s9, end_segment_loop cmplei s9, s5, 0 bnez s9, end_segment_loop

add s6, s0, s0 addi s8, s4, -1 character_loop: cmplti s9, s8, 0 bnez s9, end_character_loop lbu s9, pattern(s8) cmpeqi s9, s9, 42 bnez s9, end_character_loop

lbu s9, pattern(s8) cmpeq p5, p2, s9 cmpeq p6, p3, s6 and p5, p5, p6 cmplt p6, p15, s5 and p5, p5, p6 nnext p5, p5 addi p6, p3, 1 ifnez p5 nnext p3, p6 endif

addi s6, s6, 1 addi s8, s8, -1 j character_loop end_character_loop: cmpeq p5, p3, s6 add p6, p0, p0 nprev p6, p5 ifnez p6 add p7, p0, s6 sh p7, segment(s7) endif

lh p5, segment(s7) cmplei p5, p5, 0 ifeqz p5 count s9 beqz s9, no_responders max s5, p15 j end_where no_responders: add s5, s0, s0 end_where: endif

add p3, p0, p0 addi s6, s6, 1 addi s7, s7, 1 j segment_loop

176

end_segment_loop: addi s7, s7, -1

lh p5, segment(s7) cmplei p5, p5, 0 ifeqz p5 addi p4, p0, -1 endif

# Special handling for pattern with leading wildcard lbu s9, pattern(s0) cmpeqi s9, s9, 42 bnez s9, end cmplt p5, p15, s5 ifnez p5 ifnez p15 addi p4, p0, -1 endif endif end: j thread_loop

APPENDIX D

ASC Library for Cn

D.1. asc.h

/* * ASC Library 2.1 * * Author: Kevin Schaffer * Last updated: April 4, 2010 */

#if !defined(ASC_H) #define ASC_H

/** * Type to represent Boolean values. */ typedef enum bool { false, true } bool;

/** * Returns the number of nonzero components in a poly bool. */ short count(poly bool condition);

/** * Converts a poly char into a mono char. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ char get_char(poly char value);

/** * Converts a poly short into a mono short. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ short get_short(poly short value);

177

178

/** * Converts a poly int into a mono int. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ int get_int(poly int value);

/** * Converts a poly long into a mono long. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ long get_long(poly long value);

/** * Converts a poly unsigned char into a mono unsigned char. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned char get_unsigned_char(poly unsigned char value);

/** * Converts a poly unsigned short into a mono unsigned short. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned short get_unsigned_short(poly unsigned short value);

/** * Converts a poly unsigned int into a mono unsigned int. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned int get_unsigned_int(poly unsigned int value);

/** * Converts a poly unsigned long into a mono unsigned long. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned long get_unsigned_long(poly unsigned long value);

/** * Converts a poly float into a mono float. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ float get_float(poly float value);

179

/** * Converts a poly double into a mono double. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ double get_double(poly double value);

/** * Copies a poly string into a mono buffer. * * Exactly one PE must be active when calling this function otherwise * the results are undefined. * * Returns the length of the string copied into the buffer. */ size_t get_string(char *buffer, size_t buffer_len, poly const char *value);

/** * Returns the largest component of a poly char. * * If there are no active PEs, returns the smallest possible * char value. */ char max_char(poly char value);

/** * Returns the largest component of a poly short. * * If there are no active PEs, returns the smallest possible * short value. */ short max_short(poly short value);

/** * Returns the largest component of a poly int. * * If there are no active PEs, returns the smallest possible * int value. */ int max_int(poly int value);

/** * Returns the largest component of a poly long. * * If there are no active PEs, returns the smallest possible * long value. */ long max_long(poly long value);

/** * Returns the largest component of a poly unsigned char. * * If there are no active PEs, returns the smallest possible * unsigned char value. */ unsigned char max_unsigned_char(poly unsigned char value);

180

/** * Returns the largest component of a poly unsigned short. * * If there are no active PEs, returns the smallest possible * unsigned short value. */ unsigned short max_unsigned_short(poly unsigned short value);

/** * Returns the largest component of a poly unsigned int. * * If there are no active PEs, returns the smallest possible * unsigned int value. */ unsigned int max_unsigned_int(poly unsigned int value);

/** * Returns the largest component of a poly unsigned long. * * If there are no active PEs, returns the smallest possible * unsigned long value. */ unsigned long max_unsigned_long(poly unsigned long value);

/** * Returns the largest component of a poly float. * * If there are no active PEs, returns negative infinity. */ float max_float(poly float value);

/** * Returns the largest component of a poly double. * * If there are no active PEs, returns negative infinity. */ double max_double(poly double value);

/** * Locates the component of a poly string that sorts last lexicographically * and copies it into the supplied buffer. * * If there are no active PEs, copies an empty string into the buffer. * * Returns the length of the string copied into the buffer. */ size_t max_string(char *buffer, size_t buffer_len, poly const char *value);

/** * Returns the smallest component of a poly char. * * If there are no active PEs, returns the largest possible * char value. */ char min_char(poly char value);

181

/** * Returns the smallest component of a poly short. * * If there are no active PEs, returns the largest possible * short value. */ short min_short(poly short value);

/** * Returns the smallest component of a poly int. * * If there are no active PEs, returns the largest possible * int value. */ int min_int(poly int value);

/** * Returns the smallest component of a poly long. * * If there are no active PEs, returns the largest possible * long value. */ long min_long(poly long value);

/** * Returns the smallest component of a poly unsigned char. * * If there are no active PEs, returns the largest possible * unsigned char value. */ unsigned char min_unsigned_char(poly unsigned char value);

/** * Returns the smallest component of a poly unsigned short. * * If there are no active PEs, returns the largest possible * unsigned short value. */ unsigned short min_unsigned_short(poly unsigned short value);

/** * Returns the smallest component of a poly unsigned int. * * If there are no active PEs, returns the largest possible * unsigned int value. */ unsigned int min_unsigned_int(poly unsigned int value);

/** * Returns the smallest component of a poly unsigned long. * * If there are no active PEs, returns the largest possible * unsigned long value. */ unsigned long min_unsigned_long(poly unsigned long value);

182

/** * Returns the smallest component of a poly float. * * If there are no active PEs, returns positive infinity. */ float min_float(poly float value);

/** * Returns the smallest component of a poly double. * * If there are no active PEs, returns positive infinity. */ double min_double(poly double value);

/** * Locates the component of a poly string that sorts first lexicographically * and copies it into the supplied buffer. * * If there are no active PEs, copies an empty string into the buffer. * * Returns the length of the string copied into the buffer. */ size_t min_string(char *buffer, size_t buffer_len, poly const char *value);

/** * Returns a poly bool that is nonzero for at most one PE and zero for * all other PEs. */ poly bool select_one(void);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * char and zero for all others. */ poly bool select_max_char(poly char value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * short and zero for all others. */ poly bool select_max_short(poly short value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * int and zero for all others. */ poly bool select_max_int(poly int value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * long and zero for all others. */ poly bool select_max_long(poly long value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned char and zero for all others. */ poly bool select_max_unsigned_char(poly unsigned char value);

183

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned short and zero for all others. */ poly bool select_max_unsigned_short(poly unsigned short value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned int and zero for all others. */ poly bool select_max_unsigned_int(poly unsigned int value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned long and zero for all others. */ poly bool select_max_unsigned_long(poly unsigned long value);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * float and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the largest for that PE to be selected. */ poly bool select_max_float(poly float value, float tolerance);

/** * Returns a poly bool that is nonzero for PEs that contain the largest * double and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the largest for that PE to be selected. */ poly bool select_max_double(poly double value, double tolerance);

/** * Returns a poly bool that is nonzero for PEs that contain the string * which sorts last lexicographically. */ poly bool select_max_string(poly const char *value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * char and zero for all others. */ poly bool select_min_char(poly char value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * short and zero for all others. */ poly bool select_min_short(poly short value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * int and zero for all others. */ poly bool select_min_int(poly int value);

184

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * long and zero for all others. */ poly bool select_min_long(poly long value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned char and zero for all others. */ poly bool select_min_unsigned_char(poly unsigned char value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned short and zero for all others. */ poly bool select_min_unsigned_short(poly unsigned short value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned int and zero for all others. */ poly bool select_min_unsigned_int(poly unsigned int value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned long and zero for all others. */ poly bool select_min_unsigned_long(poly unsigned long value);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * float and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the smallest for that PE to be selected. */ poly bool select_min_float(poly float value, float tolerance);

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * double and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the smallest for that PE to be selected. */ poly bool select_min_double(poly double value, double tolerance);

/** * Returns a poly bool that is nonzero for PEs that contain the string * which sorts first lexicographically. */ poly bool select_min_string(poly const char *value);

#endif

185

D.2. asc.cn

/* * ASC Library 2.1 * * Author: Kevin Schaffer * Last updated: April 4, 2010 */

#include #include #include #include "asc.h"

// Sign bits #define CHAR_SIGN 0x80 #define SHORT_SIGN 0x8000 #define INT_SIGN 0x80000000u

// Non-sign bits #define CHAR_REST 0x7f #define SHORT_REST 0x7fff #define INT_REST 0x7fffffffu

// Structure that allows access to double as two ints typedef union double_bits { double d; struct { unsigned int lo; unsigned int hi; } i; } double_bits;

// Inline assembly functions unsigned char max_unsigned_char_ila(poly unsigned char value); unsigned short max_unsigned_short_ila(poly unsigned short value); unsigned int max_unsigned_int_ila(poly unsigned int value); double max_double_ila(poly double value);

/** * Returns the number of nonzero components in a poly bool. */ short count(poly bool condition) { short result = 0;

while (condition) { if (select_one()) { result++; break; } }

186

return result; }

/** * Converts a poly char into a mono char. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ char get_char(poly char value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return p2m_and_char(value); }

/** * Converts a poly short into a mono short. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ short get_short(poly short value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return p2m_and_short(value); }

/** * Converts a poly int into a mono int. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ int get_int(poly int value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return p2m_and_int(value); }

/** * Converts a poly long into a mono long. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ long get_long(poly long value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

187

return (long) p2m_and_int((poly int) value); }

/** * Converts a poly unsigned char into a mono unsigned char. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned char get_unsigned_char(poly unsigned char value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return (unsigned char) p2m_and_char((poly char) value); }

/** * Converts a poly unsigned short into a mono unsigned short. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned short get_unsigned_short(poly unsigned short value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return (unsigned short) p2m_and_short((poly short) value); }

/** * Converts a poly unsigned int into a mono unsigned int. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned int get_unsigned_int(poly unsigned int value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return (unsigned int) p2m_and_int((poly int) value); }

/** * Converts a poly unsigned long into a mono unsigned long. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ unsigned long get_unsigned_long(poly unsigned long value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

188

return (unsigned long) p2m_and_int((poly int) value); }

/** * Converts a poly float into a mono float. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ float get_float(poly float value) { int result;

#if defined(ASC_SAFE) assert(count(true) == 1); #endif

result = p2m_and_int(*((poly int *) &value)); return *((float *) &result); }

/** * Converts a poly double into a mono double. * * Exactly one PE must be active when calling this function otherwise * the return value is undefined. */ double get_double(poly double value) { #if defined(ASC_SAFE) assert(count(true) == 1); #endif

return p2m_and_8byte(value); }

/** * Copies a poly string into a mono buffer. * * Exactly one PE must be active when calling this function otherwise * the results are undefined. * * Returns the length of the string copied into the buffer. */ size_t get_string(char *buffer, size_t buffer_len, poly const char *value) { size_t i; size_t len = 0;

#if defined(ASC_SAFE) assert(count(true) == 1); #endif

for (i = 0; value[i] != '\0'; i++) { if (len < buffer_len - 1) { buffer[len++] = p2m_and_char(value[i]); } }

189

if (len < buffer_len) { buffer[len] = '\0'; }

return len; }

/** * Returns the largest component of a poly char. * * If there are no active PEs, returns the smallest possible * char value. */ char max_char(poly char value) { // Invert sign bit poly unsigned char temp = (poly unsigned char) (value ^ CHAR_SIGN); return (char) (max_unsigned_char_ila(temp) ^ CHAR_SIGN); }

/** * Returns the largest component of a poly short. * * If there are no active PEs, returns the smallest possible * short value. */ short max_short(poly short value) { // Invert sign bit poly unsigned short temp = (poly unsigned short) (value ^ SHORT_SIGN); return (short) (max_unsigned_short_ila(temp) ^ SHORT_SIGN); }

/** * Returns the largest component of a poly int. * * If there are no active PEs, returns the smallest possible * int value. */ int max_int(poly int value) { // Invert sign bit poly unsigned int temp = (poly unsigned int) (value ^ INT_SIGN); return (int) (max_unsigned_int_ila(temp) ^ INT_SIGN); }

/** * Returns the largest component of a poly long. * * If there are no active PEs, returns the smallest possible * long value. */ long max_long(poly long value) { // Invert sign bit poly unsigned int temp = (poly unsigned int) (value ^ INT_SIGN); return (long) (max_unsigned_int_ila(temp) ^ INT_SIGN); }

190

/** * Returns the largest component of a poly unsigned char. * * If there are no active PEs, returns the smallest possible * unsigned char value. */ unsigned char max_unsigned_char(poly unsigned char value) { return max_unsigned_char_ila(value); }

/** * Returns the largest component of a poly unsigned short. * * If there are no active PEs, returns the smallest possible * unsigned short value. */ unsigned short max_unsigned_short(poly unsigned short value) { return max_unsigned_short_ila(value); }

/** * Returns the largest component of a poly unsigned int. * * If there are no active PEs, returns the smallest possible * unsigned int value. */ unsigned int max_unsigned_int(poly unsigned int value) { return max_unsigned_int_ila(value); }

/** * Returns the largest component of a poly unsigned long. * * If there are no active PEs, returns the smallest possible * unsigned long value. */ unsigned long max_unsigned_long(poly unsigned long value) { return (unsigned long) max_unsigned_int_ila((poly unsigned int) value); }

/** * Returns the largest component of a poly float. * * If there are no active PEs, returns negative infinity. */ float max_float(poly float value) { poly unsigned int temp; unsigned int result;

if (!any(true)) { // If no active PEs, return negative infinity return -infinityf(); }

191

if (any(isnanp(value))) { // If any component is a NaN, return a NaN return nanf(NULL); }

temp = *((poly unsigned int *) &value);

if ((temp & INT_SIGN) == INT_SIGN) { // If value is negative, invert all bits except sign temp ^= INT_REST; }

// Invert sign bit temp ^= INT_SIGN;

result = max_unsigned_int_ila(temp);

// Invert sign bit result ^= INT_SIGN;

if ((result & INT_SIGN) == INT_SIGN) { // If result is negative, invert all bits except sign result ^= INT_REST; }

return *((float *) &result); }

/** * Returns the largest component of a poly double. * * If there are no active PEs, returns negative infinity. */ double max_double(poly double value) { poly double_bits temp; double_bits result;

if (!any(true)) { // If no active PEs, return negative infinity return -infinity(); }

if (any(isnanp(value))) { // If any component is a NaN, return a NaN return nan(NULL); }

temp.d = value;

if ((temp.i.hi & INT_SIGN) == INT_SIGN) { // If value is negative, invert all bits except sign temp.i.hi ^= INT_REST;

192

temp.i.lo = ~temp.i.lo; }

// Invert sign bit temp.i.hi ^= INT_SIGN;

result.d = max_double_ila(temp.d);

// Invert sign bit result.i.hi ^= INT_SIGN;

if ((result.i.hi & INT_SIGN) == INT_SIGN) { // If result is negative, invert all bits except sign result.i.hi ^= INT_REST; result.i.lo = ~result.i.lo; }

return result.d; }

/** * Locates the component of a poly string that sorts last lexicographically * and copies it into the supplied buffer. * * If there are no active PEs, copies an empty string into the buffer. * * Returns the length of the string copied into the buffer. */ size_t max_string(char *buffer, size_t buffer_len, poly const char *value) { if (select_max_string(value)) { if (select_one()) { return get_string(buffer, buffer_len, value); } } }

/** * Returns the smallest component of a poly char. * * If there are no active PEs, returns the largest possible * char value. */ char min_char(poly char value) { // Invert all bits except for sign poly unsigned char temp = (poly unsigned char) (value ^ CHAR_REST); return (char) (max_unsigned_char_ila(temp) ^ CHAR_REST); }

/** * Returns the smallest component of a poly short. * * If there are no active PEs, returns the largest possible * short value. */ short min_short(poly short value)

193

{ // Invert all bits except for sign poly unsigned short temp = (poly unsigned short) (value ^ SHORT_REST); return (short) (max_unsigned_short_ila(temp) ^ SHORT_REST); }

/** * Returns the smallest component of a poly int. * * If there are no active PEs, returns the largest possible * int value. */ int min_int(poly int value) { // Invert all bits except for sign poly unsigned int temp = (poly unsigned int) (value ^ INT_REST); return (int) (max_unsigned_int_ila(temp) ^ INT_REST); }

/** * Returns the smallest component of a poly long. * * If there are no active PEs, returns the largest possible * long value. */ long min_long(poly long value) { // Invert all bits except for sign poly unsigned int temp = (poly unsigned int) (value ^ INT_REST); return (long) (max_unsigned_int_ila(temp) ^ INT_REST); }

/** * Returns the smallest component of a poly unsigned char. * * If there are no active PEs, returns the largest possible * unsigned char value. */ unsigned char min_unsigned_char(poly unsigned char value) { // Invert all bits poly unsigned char temp = (poly unsigned char) ~value; return (unsigned char) ~max_unsigned_char_ila(temp); }

/** * Returns the smallest component of a poly unsigned short. * * If there are no active PEs, returns the largest possible * unsigned short value. */ unsigned short min_unsigned_short(poly unsigned short value) { // Invert all bits poly unsigned short temp = (poly unsigned short) ~value; return (unsigned short) ~max_unsigned_short_ila(temp); }

194

/** * Returns the smallest component of a poly unsigned int. * * If there are no active PEs, returns the largest possible * unsigned int value. */ unsigned int min_unsigned_int(poly unsigned int value) { // Invert all bits poly unsigned int temp = (poly unsigned int) ~value; return (unsigned int) ~max_unsigned_int_ila(temp); }

/** * Returns the smallest component of a poly unsigned long. * * If there are no active PEs, returns the largest possible * unsigned long value. */ unsigned long min_unsigned_long(poly unsigned long value) { // Invert all bits poly unsigned int temp = (poly unsigned int) ~value; return (unsigned long) ~max_unsigned_int_ila(temp); }

/** * Returns the smallest component of a poly float. * * If there are no active PEs, returns positive infinity. */ float min_float(poly float value) { poly unsigned int temp; unsigned int result;

if (!any(true)) { // If no active PEs, return positive infinity return infinityf(); }

if (any(isnanp(value))) { // If any component is a NaN, return a NaN return nanf(NULL); }

temp = *((poly unsigned int*) &value);

if ((temp & INT_SIGN) == 0) { // If value is positive, invert all bits except sign temp ^= INT_REST; }

result = max_unsigned_int_ila(temp);

if ((result & INT_SIGN) == 0) {

195

// If result is positive, invert all bits except sign result ^= INT_REST; }

return *((float *) &result); }

/** * Returns the smallest component of a poly double. * * If there are no active PEs, returns positive infinity. */ double min_double(poly double value) { poly double_bits temp; double_bits result;

if (!any(true)) { // If no active PEs, return positive infinity return infinity(); }

if (any(isnanp(value))) { // If any component is a NaN, return a NaN return nan(NULL); }

temp.d = value;

if ((temp.i.hi & INT_SIGN) == 0) { // If value is positive, invert all bits except sign temp.i.hi ^= INT_REST; temp.i.lo = ~temp.i.lo; }

result.d = max_double_ila(temp.d);

if ((result.i.hi & INT_SIGN) == 0) { // If result is positive, invert all bits except sign result.i.hi ^= INT_REST; result.i.lo = ~result.i.lo; }

return result.d; }

/** * Locates the component of a poly string that sorts first lexicographically * and copies it into the supplied buffer. * * If there are no active PEs, copies an empty string into the buffer. * * Returns the length of the string copied into the buffer. */ size_t min_string(char *buffer, size_t buffer_len, poly const char *value) {

196

if (select_min_string(value)) { if (select_one()) { return get_string(buffer, buffer_len, value); } } }

/** * Returns a poly bool that is nonzero for at most one PE and zero for * all other PEs. */ poly bool select_one(void) { return select_min_short(get_penum()); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * char and zero for all others. */ poly bool select_max_char(poly char value) { return (value == max_char(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * short and zero for all others. */ poly bool select_max_short(poly short value) { return (value == max_short(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * int and zero for all others. */ poly bool select_max_int(poly int value) { return (value == max_int(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * long and zero for all others. */ poly bool select_max_long(poly long value) { return (value == max_long(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned char and zero for all others. */ poly bool select_max_unsigned_char(poly unsigned char value) {

197

return (value == max_unsigned_char(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned short and zero for all others. */ poly bool select_max_unsigned_short(poly unsigned short value) { return (value == max_unsigned_short(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned int and zero for all others. */ poly bool select_max_unsigned_int(poly unsigned int value) { return (value == max_unsigned_int(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * unsigned long and zero for all others. */ poly bool select_max_unsigned_long(poly unsigned long value) { return (value == max_unsigned_long(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * float and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the largest for that PE to be selected. */ poly bool select_max_float(poly float value, float tolerance) { return (value - max_float(value) <= tolerance); }

/** * Returns a poly bool that is nonzero for PEs that contain the largest * double and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the largest for that PE to be selected. */ poly bool select_max_double(poly double value, double tolerance) { return (value - max_double(value) <= tolerance); }

/** * Returns a poly bool that is nonzero for PEs that contain the string * which sorts last lexicographically. */ poly bool select_max_string(poly const char *value) { size_t i;

for (i = 0; (poly bool) true; i++) {

198

if (value[i] != max_char(value[i])) { return false; } else if (value[i] == '\0') { break; } }

return true; }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * char and zero for all others. */ poly bool select_min_char(poly char value) { return (value == min_char(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * short and zero for all others. */ poly bool select_min_short(poly short value) { return (value == min_short(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * int and zero for all others. */ poly bool select_min_int(poly int value) { return (value == min_int(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * long and zero for all others. */ poly bool select_min_long(poly long value) { return (value == min_long(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned char and zero for all others. */ poly bool select_min_unsigned_char(poly unsigned char value) { return (value == min_unsigned_char(value)); }

199

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned short and zero for all others. */ poly bool select_min_unsigned_short(poly unsigned short value) { return (value == min_unsigned_short(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned int and zero for all others. */ poly bool select_min_unsigned_int(poly unsigned int value) { return (value == min_unsigned_int(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * unsigned long and zero for all others. */ poly bool select_min_unsigned_long(poly unsigned long value) { return (value == min_unsigned_long(value)); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * float and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the smallest for that PE to be selected. */ poly bool select_min_float(poly float value, float tolerance) { return (value - min_float(value) <= tolerance); }

/** * Returns a poly bool that is nonzero for PEs that contain the smallest * double and zero for all others. The tolerance parameter specifies how * close a PE's value must be to the smallest for that PE to be selected. */ poly bool select_min_double(poly double value, double tolerance) { return (value - min_double(value) <= tolerance); }

/** * Returns a poly bool that is nonzero for PEs that contain the string * which sorts first lexicographically. */ poly bool select_min_string(poly const char *value) { size_t i;

for (i = 0; (poly bool) true; i++) { if (value[i] != min_char(value[i])) { return false;

200

} else if (value[i] == '\0') { break; } }

return true; }

//------// Inline assembly functions //------unsigned char __asm max_unsigned_char_ila(poly unsigned char value) { @[ .clobber value .poly .scratch value_bit .scratch result .scratch result_bit .restrict value_bit:preg .restrict result:mwreg .restrict result_bit:mwreg ]

mov @{result}, 0; if.eq @{value_bit}, @{value_bit};

.define process_bit_char(value, value_bit, result, result_bit) { // Shift value left and save the high bit in value_bit mov value_bit, 0; lsl value, value, 1; addc value_bit, value_bit, value_bit;

// Since there is no poly.to.mono.or instruction, we synthesize it // by inverting the input into poly.to.mono.and and then inverting // the result not value_bit, value_bit; poly.to.mono.and result_bit, value_bit; not value_bit, value_bit; not result_bit, result_bit;

// Shift result left, placing result_bit into the vacated spot lsl result, result, 1; and result_bit, result_bit, 1; or result, result, result_bit;

// PEs that contributed to the result continue, while the others // are switched off andif.eq value_bit, result_bit; }

process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit};

201

process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_char @{value}, @{value_bit}, @{result}, @{result_bit};

mov @{}, @{result}; endif; } unsigned short __asm max_unsigned_short_ila(poly unsigned short value) { @[ .clobber value .poly .scratch value_bit .scratch result .scratch result_bit .restrict value_bit:preg .restrict result:mwreg .restrict result_bit:mwreg ]

mov @{result}, 0; if.eq @{value_bit}, @{value_bit};

.define process_bit_short(value, value_bit, result, result_bit) { // Shift value left and save the high bit in value_bit mov value_bit, 0; lsl value, value, 1; addc value_bit, value_bit, value_bit;

// Since there is no poly.to.mono.or instruction, we synthesize it // by inverting the input into poly.to.mono.and and then inverting // the result not value_bit, value_bit; poly.to.mono.and result_bit, value_bit; not value_bit, value_bit; not result_bit, result_bit;

// Shift result left, placing result_bit into the vacated spot lsl result, result, 1; and result_bit, result_bit, 1; or result, result, result_bit;

// PEs that contributed to the result continue, while the others // are switched off andif.eq value_bit, result_bit; }

process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit};

202

process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_short @{value}, @{value_bit}, @{result}, @{result_bit};

mov @{}, @{result}; endif; } unsigned int __asm max_unsigned_int_ila(poly unsigned int value) { @[ .clobber value .poly .scratch value_bit .scratch result .scratch result_bit .restrict value_bit:preg .restrict result:mdreg .restrict result_bit:mwreg ]

mov @{result}, 0; if.eq @{value_bit}, @{value_bit};

.define process_bit_int(value, value_bit, result, result_bit) { // Shift value left and save the high bit in value_bit mov value_bit, 0; lsl value, value, 1; addc value_bit, value_bit, value_bit;

// Since there is no poly.to.mono.or instruction, we synthesize it // by inverting the input into poly.to.mono.and and then inverting // the result not value_bit, value_bit; poly.to.mono.and result_bit, value_bit; not value_bit, value_bit; not result_bit, result_bit;

// Shift result left, placing result_bit into the vacated spot lsl .lsbytes(result,2), .lsbytes(result,2), 1; lslc .msbytes(result,2), .msbytes(result,2), 1; and result_bit, result_bit, 1; or .lsbytes(result,2), .lsbytes(result,2), result_bit;

// PEs that contributed to the result continue, while the others // are switched off andif.eq value_bit, result_bit; }

process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit};

203

process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_int @{value}, @{value_bit}, @{result}, @{result_bit};

mov @{}, @{result}; endif; } double __asm max_double_ila(poly double value) { @[ .clobber value .poly .scratch value_bit .scratch result .scratch result_bit .restrict value:pddreg .restrict value_bit:preg .restrict result:mddreg .restrict result_bit:mwreg ]

mov @{result}, 0; if.eq @{value_bit}, @{value_bit};

.define process_bit_double(value, value_bit, result, result_bit) { // Shift value left and save the high bit in value_bit mov value_bit, 0; lsl .msbytes(value,4), .msbytes(value,4), 1; addc value_bit, value_bit, value_bit; lsl .lsbytes(value,4), .lsbytes(value,4), 1; addc .msbytes(value,4), .msbytes(value,4), 0;

// Since there is no poly.to.mono.or instruction, we synthesize it // by inverting the input into poly.to.mono.and and then inverting // the result not value_bit, value_bit; poly.to.mono.and result_bit, value_bit;

204

not value_bit, value_bit; not result_bit, result_bit;

// Shift result left, placing result_bit into the vacated spot lsl .lsbytes(.lsbytes(result,4),2), .lsbytes(.lsbytes(result,4),2), 1; lslc .msbytes(.lsbytes(result,4),2), .msbytes(.lsbytes(result,4),2), 1; lslc .lsbytes(.msbytes(result,4),2), .lsbytes(.msbytes(result,4),2), 1; lslc .msbytes(.msbytes(result,4),2), .msbytes(.msbytes(result,4),2), 1; and result_bit, result_bit, 1; or .lsbytes(result,2), .lsbytes(result,2), result_bit;

// PEs that contributed to the result continue, while the others // are switched off andif.eq value_bit, result_bit; } process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit};

205

process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit}; process_bit_double @{value}, @{value_bit}, @{result}, @{result_bit};

mov @{}, @{result}; endif; }