Appears in, The 31st Annual International Symposium on Computer Architecture (ISCA-31), Munich, Germany, June 2004

The Vector- Architecture

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic´ MIT and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139 ronny,cbatten,krste @csail.mit.edu

Abstract parallel code, each VP directs its own control flow by fetching its The vector-thread (VT) architectural paradigm unifies the vector own AIBs. Implementations of the VT architecture can also exploit and multithreaded compute models. The VT abstraction provides instruction-level parallelism within AIBs. the programmer with a control processor and a vector of virtual In this way, the VT architecture supports a modeless intermin- processors (VPs). The control processor can use vector-fetch com- gling of all forms of application parallelism. This flexibility pro- mands to broadcast instructions to all the VPs or each VP can use vides new ways to parallelize codes that are difficult to vectorize or thread-fetches to direct its own control flow. A seamless intermix- that incur excessive synchronization costs when threaded. Instruc- ing of the vector and threaded control mechanisms allows a VT ar- tion locality is improved by allowing common code to be factored chitecture to flexibly and compactly encode application parallelism out and executed only once on the control processor, and by execut- and locality, and a VT machine exploits these to improve perfor- ing the same AIB multiple times on each VP in turn. Data locality mance and efficiency. We present SCALE, an instantiation of the is improved as most operand communication is isolated to within VT architecture designed for low-power and high-performance em- an individual VP. bedded systems. We evaluate the SCALE prototype design using We are developing a prototype processor, SCALE, which is detailed simulation of a broad range of embedded applications and an instantiation of the vector-thread architecture designed for show that its performance is competitive with larger and more com- low-power and high-performance embedded systems. As tran- plex processors. sistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with realtime video, 1. Introduction three-dimensional graphics, and dynamic compilation of garbage- Parallelism and locality are the key application characteristics collected languages. Many other embedded applications require exploited by computer architects to make productive use of increas- sophisticated high-performance information processing, including ing transistor counts while coping with wire delay and power dissi- streaming media devices, network routers, and wireless base sta- pation. Conventional sequential ISAs provide minimal support for tions. In this paper, we show how benchmarks taken from these em- encoding parallelism or locality, so high-performance implementa- bedded domains can be mapped efficiently to the SCALE vector- tions are forced to devote considerable area and power to on-chip thread architecture. In many cases, the codes exploit multiple types structures that extract parallelism or that support arbitrary global of parallelism simultaneously for greater efficiency. communication. The large area and power overheads are justi- The paper is structured as follows. Section 2 introduces the fied by the demand for even small improvements in performance vector-thread architectural paradigm. Section 3 then describes the on legacy codes for popular ISAs. Many important applications SCALE processor which contains many features that extend the ba- have abundant parallelism, however, with dependencies and com- sic VT architecture. Section 4 presents an evaluation of the SCALE munication patterns that can be statically determined. ISAs that processor using a range of embedded benchmarks and describes expose more parallelism reduce the need for area and power in- how SCALE efficiently executes various types of code. Finally, tensive structures to extract dependencies dynamically. Similarly, Section 5 reviews related work and Section 6 concludes. ISAs that allow locality to be expressed reduce the need for long- range communication and complex interconnect. The challenge is to develop an efficient encoding of an application’s parallel depen- 2. The VT Architectural Paradigm dency graph and to reduce the area and power consumption of the An architectural paradigm consists of the programmer’s model microarchitecture that will execute this dependency graph. for a class of machines plus the expected structure of implementa- In this paper, we unify the vector and multithreaded execution tions of these machines. This section first describes the abstraction models with the vector-thread (VT) architectural paradigm. VT a VT architecture provides to a programmer, then gives an overview allows large amounts of structured parallelism to be compactly en- of the physical model for a VT machine. coded in a form that allows a simple microarchitecture to attain high performance at low power by avoiding complex control and 2.1 VT Abstract Model datapath structures and by reducing activity on long wires. The The vector-thread architecture is a hybrid of the vector and mul- VT programmer’s model extends a conventional scalar control pro- tithreaded models. A conventional control processor interacts with cessor with an array of slave virtual processors (VPs). VPs ex- a virtual processor vector (VPV), as shown in Figure 1. The pro- ecute strings of RISC-like instructions packaged into atomic in- gramming model consists of two interacting instruction sets, one struction blocks (AIBs). To execute data-parallel code, the control for the control processor and one for the VPs. Applications can processor broadcasts AIBs to all the slave VPs. To execute thread- be mapped to the VT architecture in a variety of ways but it is es- command VP0 VP1 VP[vl!1] Control vector!fetch Processor VP0 VP1 VP [vl−1] vector!fetch vector!fetch vector!fetch lb r0(r1)!>r5 lb r0(r1)!>r5 lb r0(r1)!>r5 add prevVP,r5!>r5 add prevVP,r5!>r5 add prevVP,r5!>r5 thread!fetch thread!fetch thread!fetch cross!VP slt r5,r3!>p slt r5,r3!>p slt r5,r3!>p to cross−VP start/stop (p)copy r3!>r5 (p)copy r3!>r5 (p)copy r3!>r5 start/stop queue queue from cross−VP Regs Regs Regs slt r4,r5!>p slt r4,r5!>p slt r4,r5!>p start/stop queue (p)copy r4!>r5 (p)copy r4!>r5 (p)copy r4!>r5 ALUs ALUs ALUs copy r5!>nextVP copy r5!>nextVP copy r5!>nextVP sb r5,r0(r2) sb r5,r0(r2) sb r5,r0(r2)

Figure 3: Cross-VP data transfers. For loops with cross-iteration de- Memory pendencies, the control processor can vector-fetch an AIB that contains cross-VP data transfers. In this saturating parallel prefix sum example, Figure 1: Abstract model of a vector-thread architecture. A control we assume that r0 has been loaded with each VP’s index number, r1 processor interacts with a virtual processor vector (an ordered se- and r2 contain the base addresses of the input and output vectors, and quence of VPs). r3 and r4 contain the min and max values of the saturation range. The instruction notation uses “(p)” to indicate predication. VP0 VP1 VP[vl1] vectorfetch lw 0(r0)!>r0 seq r0,0!>p lb r0(r1)!>r4 lb r0(r1)!>r4 lb r0(r1)!>r4 (!p) fetch r1 lb r0(r2)!>r5 lb r0(r2)!>r5 lb r0(r2)!>r5 add r2,1!>r2 thread!fetch add r4,r5!>r6 add r4,r5!>r6 add r4,r5!>r6 sb r6,r0(r3) sb r6,r0(r3) sb r6,r0(r3) lw 0(r0)!>r0 seq r0,0!>p (!p) fetch r1 Figure 2: Vector-fetch commands. For simple data-parallel loops, the add r2,1!>r2 thread!fetch control processor can use a vector-fetch command to send an atomic instruction block (AIB) to all the VPs in parallel. In this vector-vector lw 0(r0)!>r0 seq r0,0!>p add example, we assume that r0 has been loaded with each VP’s in- (!p) fetch r1 dex number; and r1, r2, and r3 contain the base addresses of the in- add r2,1!>r2 put and output vectors. The instruction notation places the destination registers after the “->”. Figure 4: VP threads. Thread-fetches allow a VP to request its own AIBs and thereby direct its own control flow. In this pointer-chase ex- ample, we assume that r0 contains a pointer to a linked list, r1 contains pecially well suited to executing loops; each VP executes a single the address of the AIB, and r2 contains a count of the number of links iteration of the loop and the control processor is responsible for traversed. managing the execution. A virtual processor contains a set of registers and has the abil- around and is written into this same queue. It can then be popped ity to execute RISC-like instructions with virtual register specifiers. by the control processor or consumed by a subsequent prevVP VP instructions are grouped into atomic instruction blocks (AIBs), receive on VP0 during stripmined loop execution. the unit of work issued to a VP at one time. There is no auto- The VT architecture also allows VPs to direct their own control matic or implicit instruction fetch mechanism for flow. A VP executes a thread-fetch to request an AIB to execute af- VPs; all instruction blocks must be explicitly requested by either ter it completes its active AIB, as shown in Figure 4. Fetch instruc- the control processor or the VP itself. tions may be predicated to provide conditional branching. A VP The control processor can direct the VPs’ execution using a thread persists as long as each AIB contains an executed fetch in- vector-fetch command to issue an AIB to all the VPs in parallel, struction, but halts once the VP stops issuing thread-fetches. Once or a VP-fetch to target an individual VP. Vector-fetch commands a VP thread is launched, it executes to completion before the next provide a programming model similar to conventional vector ma- command from the control processor takes effect. The control pro- chines except that a large block of instructions can be issued at cessor and VPs all operate concurrently in the same address space. once. As a simple example, Figure 2 shows the mapping for a data- Memory dependencies between these processors are preserved via parallel vector-vector add loop. The AIB for one iteration of the explicit memory fence and synchronization operations or atomic loop contains two loads, an add, and a store. A vector-fetch com- read-modify-write operations. mand sends this AIB to all the VPs in parallel and thus initiates vl The ability to freely intermix vector-fetches and thread-fetches loop iterations, where vl is the length of the VPV (i.e., the vec- allows a VT architecture to combine the best attributes of the vec- tor length). Every VP executes the same instructions but operates tor and multithreaded execution paradigms. As shown in Figure 5, on distinct data elements as determined by its index number. As the control processor can issue a vector-fetch command to launch a a more efficient alternative to the individual VP loads and stores vector of VP threads, each of which continues to execute as long as shown in the example, a VT architecture can also provide vector- it issues thread-fetches. These thread-fetches break the rigid con- memory commands issued by the control processor which move a trol flow of traditional vector machines, enabling the VP threads vector of elements between memory and one register in each VP. to follow independent control paths. Thread-fetches broaden the The VT abstract model connects VPs in a unidirectional ring range of loops which can be mapped efficiently to VT, allowing topology and allows a sending instruction on VP ( ) to transfer the VPs to execute data-parallel loop iterations with conditionals data directly to a receiving instruction on VP . These cross- or even inner-loops. Apart from loops, the VPs can also be used as VP data transfers are dynamically scheduled and resolve when the free-running threads, where they operate independently from the data becomes available. Cross-VP data transfers allow loops with control processor and retrieve tasks from a shared work queue. cross-iteration dependencies to be efficiently mapped to the vector- The VT architecture allows software to efficiently expose struc- thread architecture, as shown by the example in Figure 3. A single tured parallelism and locality at a fine granularity. Compared to vector-fetch command can introduce a chain of prevVP receives a conventional threaded architecture, the VT model allows com- and nextVP sends that spans the VPV. The control processor can mon bookkeeping code to be factored out and executed once on push an initial value into the cross-VP start/stop queue (shown in the control processor rather than redundantly in each thread. AIBs Figure 1) before executing the vector-fetch command. After the enable a VT machine to efficiently amortize instruction fetch over- chain executes, the final cross-VP data value from the last VP wraps head, and provide a framework for cleanly handling temporary

2 command Control Lane 0 Lane 1 Lane 2 Lane 3 Processor Command Management Unit Command Management Unit Command Management Unit Command Management Unit

cmdQ threadfetch cmdQ threadfetch cmdQ threadfetch cmdQ threadfetch

addr. AIB AIB AIB AIB tags tags tags tags miss Execution Cluster Execution Cluster Execution Cluster Execution Cluster AIB Fill Unit VP12 VP13 VP14 VP15 VP8 VP9 VP10 VP11 crossVP execute execute execute execute start/stop directive VP4 directive VP5 directive VP6 directive VP7 queue VP0 VP1 VP2 VP3 VP VP VP VP

AIB AIB AIB AIB ALU cache ALU cache ALU cache ALU

L1 Cache

Figure 6: Physical model of a VT machine. The implementation shown has four parallel lanes in the vector-thread unit (VTU), and VPs are striped across the lane array with the low-order bits of a VP index indicating the lane to which it is mapped. The configuration shown uses VPs with five virtual registers, and with twenty physical registers each lane is able to support four VPs. Each lane is divided into a command management unit (CMU) and an execution cluster, and the execution cluster has an associated cross-VP start-stop queue.

VP0 VP1 VP2 VP3 VP[vl!1] vector!fetch Lane 0 Lane 1 Lane 2 Lane 3 vector!fetch AIB vector!fetch VP0 VP1 VP2 VP3

VP4 VP5 VP6 VP7 Time thread!fetch VP8 VP9 VP10 VP11 vector!fetch

vector!fetch VP0 VP1 VP2 VP3

VP4 VP5 VP6 VP7

VP8 VP9 VP10 VP11 Figure 5: The control processor can use a vector-fetch command to thread!fetch VP2 VP3 send an AIB to all the VPs, after which each VP can use thread-fetches VP0 VP7 to fetch its own AIBs. VP4 state. Vector-fetch commands explicitly encode parallelism and Figure 7: Lane Time-Multiplexing. Both vector-fetched and thread- instruction locality, allowing a VT machine to attain high perfor- fetched AIBs are time-multiplexed on the physical lanes. mance while amortizing control overhead. Vector-memory com- mands avoid separate load and store requests for each element, The lane’s CMU buffers commands from the control processor in and can be used to exploit memory data-parallelism even in loops a queue (cmd-Q) and holds pending thread-fetch addresses for the with non-data-parallel compute. For loops with cross-iteration de- lane’s VPs. The CMU also holds the tags for the lane’s AIB cache. pendencies, cross-VP data transfers explicitly encode fine-grain The AIB cache can hold one or more AIBs and must be at least communication and synchronization, avoiding heavyweight inter- large enough to hold an AIB of the maximum size defined in the thread memory coherence and synchronization primitives. VT architecture. The CMU chooses a vector-fetch, VP-fetch, or thread-fetch com- 2.2 VT Physical Model mand to process. The fetch command contains an address which is An architectural paradigm’s physical model is the expected looked up in the AIB tags. If there is a miss, a request is sent to structure for efficient implementations of the abstract model. The the fill unit which retrieves the requested AIB from the primary VT physical model contains a conventional scalar unit for the con- cache. The fill unit handles one lane’s AIB miss at a time, except if trol processor together with a vector-thread unit (VTU) that exe- lanes are executing vector-fetch commands when refill overhead is cutes the VP code. To exploit the parallelism exposed by the VT ab- amortized by broadcasting the AIB to all lanes simultaneously. stract model, the VTU contains a parallel array of processing lanes After a fetch command hits in the AIB cache or after a miss refill as shown in Figure 6. Lanes are the physical processors which VPs has been processed, the CMU generates an execute directive which map onto, and the VPV is striped across the lane array. Each lane contains an index into the AIB cache. For a vector-fetch command contains physical registers, which hold the state of VPs mapped to the execute directive indicates that the AIB should be executed by the lane, and functional units, which are time-multiplexed across all VPs mapped to the lane, while for a VP-fetch or thread-fetch the VPs. In contrast to traditional vector machines, the lanes in a command it identifies a single VP to execute the AIB. The execute VT machine execute decoupled from each other. Figure 7 shows an directive is sent to a queue in the execution cluster, leaving the abstract view of how VP execution is time-multiplexed on the lanes CMU free to begin processing the next command. The CMU is for both vector-fetched and thread-fetched AIBs. This fine-grain able to overlap the AIB cache refill for new fetch commands with interleaving helps VT machines hide functional unit, memory, and the execution of previous ones, but must track which AIBs have thread-fetch latencies. outstanding execute directives to avoid overwriting their entries in As shown in Figure 6, each lane contains both a command man- the AIB cache. The CMU must also ensure that the VP threads agement unit (CMU) and an execution cluster. An execution cluster execute to completion before initiating a subsequent vector-fetch. consists of a register file, functional units, and a small AIB cache. To process an execute directive, the cluster reads VP instructions

3 one by one from the AIB cache and executes them for the appropri- the two ALU input operands. These can be used as sources and ate VP. When processing an execute-directive from a vector-fetch destinations to avoid reading and writing the register files. Like command, all of the instructions in the AIB are executed for one VP shared registers, chain registers may be overwritten between AIBs, before moving on to the next. The virtual register indices in the VP and they are also implicitly overwritten when a VP instruction uses instructions are combined with the active VP number to create an their associated operand position. Cluster 0 has a special chain reg- index into the physical register file. To execute a fetch instruction, ister called the store-data (sd) register through which all data for the cluster sends the requested AIB address to the CMU where the VP stores must pass. VP’s associated pending thread-fetch register is updated. In the SCALE architecture, the control processor configures the The lanes in the VTU are inter-connected with a unidirectional VPs by indicating how many shared and private registers are re- ring network to implement the cross-VP data transfers. When a quired in each cluster. The length of the virtual processor vector cluster encounters an instruction with a prevVP receive, it stalls changes with each re-configuration to reflect the maximum num- until the data is available from its predecessor lane. When the VT ber of VPs that can be supported. This operation is typically done architecture allows multiple cross-VP instructions in a single AIB, once outside each loop, and state in the VPs is undefined across re- with some sends preceding some receives, the hardware implemen- configurations. Within a lane, the VTU maps shared VP registers tation must provide sufficient buffering of send data to allow all the to shared physical registers. Control processor vector-writes to a receives in an AIB to execute. By induction, deadlock is avoided if shared register are broadcast to each lane, but individual VP writes each lane ensures that its predecessor can never be blocked trying to a shared register are not coherent across lanes, i.e., the shared to send it cross-VP data. registers are not global registers. 3. The SCALE VT Architecture Vector-Memory Commands SCALE is an instance of the VT architectural paradigm designed In addition to VP load and store instructions, SCALE defines for embedded systems. The SCALE architecture has a MIPS-based vector-memory commands issued by the control processor for effi- control processor extended with a VTU. The SCALE VTU aims to cient structured memory accesses. Like vector-fetch commands, provide high performance at low power for a wide range of appli- these operate across the virtual processor vector; a vector-load cations while using only a small area. This section describes the writes the load data to a private register in each VP, while a vector- SCALE VT architecture, presents a simple code example imple- store reads the store data from a private register in each VP. SCALE mented on SCALE, and gives an overview of the SCALE microar- also supports vector-load commands which target shared registers chitecture and SCALE processor prototype. to retrieve values used by all VPs. In addition to the typical unit- stride and strided vector-memory access patterns, SCALE provides 3.1 SCALE Extensions to VT vector segment accesses where each VP loads or stores several con- tiguous memory elements to support “array-of-structures” data lay- Clusters outs efficiently. To improve performance while reducing area, energy and circuit delay, SCALE extends the single-cluster VT model (shown in Fig- 3.2 SCALE Code Example ure 1) by partitioning VPs into multiple execution clusters with in- This section presents a simple code example to show how dependent register sets. VP instructions target an individual cluster SCALE is programmed. The code in Figure 8 implements a sim- and perform RISC-like operations. Source operands must be lo- plified version of the ADPCM speech decoder. Input is read from cal to the cluster, but results can be written to any cluster in the a unit-stride byte stream and output is written to a unit-stride half- VP, and an instruction can write its result to multiple destinations. word stream. The loop is non-vectorizable because it contains two Each cluster within a VP has a separate predicate register, and in- loop-carried dependencies: the index and valpred variables are structions can be positively or negatively predicated. accumulated from one iteration to the next with saturation. The SCALE clusters are heterogeneous, but all clusters support loop also contains two table lookups. integer operations. Cluster 0 additionally supports memory access The SCALE code to implement the example decoder function instructions, cluster 1 supports fetch instructions, and cluster 3 sup- is shown in Figure 9. The code is divided into two sections with ports integer multiply and divide. Though not used in this paper, the MIPS control processor code in the .text section and SCALE VP SCALE architecture allows clusters to be enhanced with layers of code in the .sisa (SCALE ISA) section. The SCALE VP code additional functionality (e.g., floating-point operations, fixed-point implements one iteration of the loop with a single AIB; cluster 0 operations, and sub-word SIMD operations), or new clusters to be accesses memory, cluster 1 accumulates index, cluster 2 accumu- added to perform specialized operations. lates valpred, and cluster 3 does the multiply. The control processor first configures the VPs using the vcfgvl Registers and VP Configuration command to indicate the register requirements for each cluster. In The general registers in each cluster of a VP are categorized as ei- this example, c0 uses one private register to hold the input data and ther private registers (pr’s) and shared registers (sr’s). Both pri- two shared registers to hold the table pointers; c1 and c2 each use vate and shared registers can be read and written by VP instructions three shared registers to hold the min and max saturation values and by commands from the control processor. The main difference and a temporary; c2 also uses a private register to hold the out- is that private registers preserve their values between AIBs, while put value; and c3 uses only chain registers so it does not need any shared registers may be overwritten by a different VP. Shared reg- shared or private registers. The configuration indirectly sets vl- isters can be used as temporary state within an AIB to increase the max, the maximum vector length. In a SCALE implementation number of VPs that can be supported by a fixed number of physical with 32 physical registers per cluster and four lanes, vlmax would registers. The control processor can also vector-write the shared be: , limited by the register demands of registers to broadcast scalar values and constants used by all VPs. cluster 2. The vcfgvl command also sets vl, the active vector- In addition to the general registers, each cluster also has length, to the minimum of vlmax and the length argument pro- programmer-visible chain registers (cr0 and cr1) associated with vided; the resulting length is returned as a result. The control pro-

4 void decode_ex(int len, u_int8_t* in, int16_t* out) { .text # MIPS control processor code int i; decode_ex: # a0=len, a1=in, a2=out int index = 0; # configure VPs: c0:p,s c1:p,s c2:p,s c3:p,s int valpred = 0; vcfgvl t1, a0, 1,2, 0,3, 1,3, 0,0 for(i = 0; i < len; i++) { # (vl,t1) = min(a0,vlmax) u_int8_t delta = in[i]; sll t1, t1, 1 # output stride index += indexTable[delta]; la t0, indexTable index = index < IX_MIN ? IX_MIN : index; vwrsh t0, c0/sr0 # indexTable addr. index = IX_MAX < index ? IX_MAX : index; la t0, stepsizeTable valpred += stepsizeTable[index] * delta; vwrsh t0, c0/sr1 # stepsizeTable addr. valpred = valpred < VALP_MIN ? VALP_MIN : valpred; vwrsh IX_MIN, c1/sr0 # index min valpred = VALP_MAX < valpred ? VALP_MAX : valpred; vwrsh IX_MAX, c1/sr1 # index max out[i] = valpred; vwrsh VALP_MIN, c2/sr0# valpred min } vwrsh VALP_MAX, c2/sr1# valpred max } xvppush $0, c1 # push initial index = 0 xvppush $0, c2 # push initial valpred = 0 stripmineloop: Figure 8: C code for decoder example. setvl t2, a0 # (vl,t2) = min(a0,vlmax) vlbuai a1, t2, c0/pr0 # vector-load input, inc ptr vf vtu_decode_ex # vector-fetch AIB cessor next vector-writes several shared VP registers with constants vshai a2, t1, c2/pr0 # vector-store output, inc ptr using the vwrsh command, then uses the xvppush command to subu a0, t2 # decrement count push the initial index and valpred values into the cross-VP bnez a0, stripmineloop # loop until done start/stop queues for clusters 1 and 2. xvppop $0, c1 # pop final index, discard xvppop $0, c2 # pop final valpred, discard The ISA for a VT architecture is defined so that code can vsync # wait until VPs are done be written to work with any number of VPs, allowing the same jr ra # return object code to run on implementations with varying or config- .sisa # SCALE VP code urable resources. To manage the execution of the loop, the con- vtu_decode_ex: trol processor uses stripmining to repeatedly launch a vector of .aib begin loop iterations. For each iteration of the stripmine loop, the con- c0 sll pr0, 2 -> cr1 # word offset c0 lw cr1(sr0) -> c1/cr0 # load index trol processor uses the setvl command which sets the vector- c0 copy pr0 -> c3/cr0 # copy delta length to the minimum of vlmax and the length argument pro- c1 addu cr0, prevVP -> cr0 # accum. index vided (i.e., the number of iterations remaining for the loop); the c1 slt cr0, sr0 -> p # index min c1 psel cr0, sr0 -> sr2 # index min resulting vector-length is also returned as a result. In the de- c1 slt sr1, sr2 -> p # index max coder example, the control processor then loads the input using c1 psel sr2, sr1 -> c0/cr0, nextVP # index max an auto-incrementing vector-load-byte-unsigned command (vl- c0 sll cr0, 2 -> cr1 # word offset c0 lw cr1(sr1) -> c3/cr1 # load step buai), vector-fetches the AIB to compute the decode, and stores c3 mult.lo cr0, cr1 -> c2/cr0 # step*delta the output using an auto-incrementing vector-store-halfword com- c2 addu cr0, prevVP -> cr0 # accum. valpred mand (vshai). The cross-iteration dependencies are passed from c2 slt cr0, sr0 -> p # valpred min one stripmine vector to the next through the cross-VP start/stop c2 psel cr0, sr0 -> sr2 # valpred min c2 slt sr1, sr2 -> p # valpred max queues. At the end of the function the control processor uses the c2 psel sr2, sr1 -> pr0, nextVP # valpred max xvppop command to pop and discard the final values. .aib end The SCALE VP code implements one iteration of the loop in a straightforward manner with no cross-iteration static scheduling. Figure 9: SCALE code implementing decoder example from Figure 8. Cluster 0 holds the delta input value in pr0 from the previous vector-load. It uses a VP load to perform the indexTable lookup and sends the result to cluster 1. Cluster 1 uses five instructions to ware ISA at compile time. There are three categories of hardware accumulate and saturate index, using prevVP and nextVP to micro-ops: a compute-op performs the main RISC-like operation of receive and send the cross-iteration value, and the psel (predicate- a VP instruction; a transport-op sends data to another cluster; and, select) instruction to optimize the saturation. Cluster 0 then per- a writeback-op receives data sent from an external cluster. The as- forms the stepsizeTable lookup using the index value, and sembler reorganizes micro-ops derived from an AIB into micro-op sends the result to cluster 3 where it is multiplied by delta. Clus- bundles which target a single cluster and do not access other clus- ter 2 uses five instructions to accumulate and saturate valpred, ters’ registers. Figure 11 shows how the SCALE VP instructions writing the result to pr0 for the subsequent vector-store. from the decoder example are translated into micro-op bundles. All inter-cluster data dependencies are encoded by the transport- 3.3 SCALE Microarchitecture ops and writeback-ops which are added to the sending and receiv- The SCALE microarchitecture is an extension of the general VT ing cluster respectively. This allows the micro-op bundles for each physical model shown in Figure 6. A lane has a single CMU and cluster to be packed together independently from the micro-op bun- one physical execution cluster per VP cluster. Each cluster has a dles for other clusters. dedicated output bus which broadcasts data to the other clusters in Partitioning inter-cluster data transfers into separate transport the lane, and it also connects to its sibling clusters in neighbor- and writeback operations enables decoupled execution between ing lanes to support cross-VP data transfers. An overview of the clusters. In SCALE, a cluster’s AIB cache contains micro-op bun- SCALE lane microarchitecture is shown in Figure 10. dles, and each cluster has a local execute directive queue and local control. Each cluster processes its transport-ops in order, broad- Micro-Ops and Cluster Decoupling casting result values onto its dedicated output data bus; and each The SCALE software ISA is portable across multiple SCALE cluster processes its writeback-ops in order, writing the values from implementations, but is designed to be easy to translate into external clusters to its local registers. The inter-cluster data depen- implementation-specific micro-operations, or micro-ops. The as- dencies are synchronized with handshake signals which extend be- sembler translates the SCALE software ISA into the native hard- tween the clusters, and a transaction only completes when both the

5 Cluster 0 Cluster 1 Cluster 2 Cluster 3 wb-op compute-op tp-op wb-op compute-op tp-op wb-op compute-op tp-op wb-op compute-op tp-op sll pr0,2 cr1 c0 cr0 addu cr0,pVP cr0 c3 cr0 addu cr0,pVP cr0 c0 cr0 lw cr1(sr0) c1 slt cr0,sr0 p slt cr0,sr0 p c0 cr1 mult cr0,cr1 c2 c1 cr0 copy pr0 c3 psel cr0,sr0 sr2 psel cr0,sr0 sr2 sll cr0,2 cr1 slt sr1,sr2 p slt sr1,sr2 p lw cr1(sr1) c3 psel sr2,sr1 nVP,c0 psel sr2,sr1 pr0 nVP

Figure 11: Cluster micro-op bundles for the AIB in Figure 9. The writeback-op field is labeled as ’wb-op’ and the transport-op field is labeled as ’tp-op’. A writeback-op is marked with ’ ’ when the dependency order is writeback-op followed by compute-op. The prevVP and nextVP identifiers are abbreviated as ’pVP’ and ’nVP’.

data Lane 0 Lane 1 Lane 2 Lane 3 execute (4x32b) directive C0 C1 C2 C3 C0 C1 C2 C3 C0 C1 C2 C3 C0 C1 C2 C3

VP4 slt psel VP2 sll Cluster 3 psel VP5 add lw AIBs writeback slt slt sll prevVP psel VP5 psel lw VP3 compute add slt cpy mul nextVP sll slt psel VP3 lw psel VP6 add transport sll slt slt lw VP4 psel VP6 psel cpy mul add slt Time VP4 sll slt psel add lw psel slt sll slt VP7 Cluster 2 VP5 AIBs writeback psel lw psel VP7 prevVP slt cpy mul add compute psel VP5 sll slt VP8 add lw psel nextVP slt sll slt transport VP8 psel lw VP6 psel add slt cpy mul slt psel VP6 sll psel VP9 add lw slt slt sll psel VP9 psel lw VP7 Cluster 1 writeback−op decoupling add slt cpy mul sll slt psel VP7 VP lw psel VP10 add sll slt slt lw VP8 psel VP10 psel src. cpy mul add slt writeback!op VP8 sll slt psel cluster add lw psel slt sll slt VP11 psel lw VP9 psel VP11 slt cpy mul add psel VP9 sll slt dest VP12 add lw psel AIB Register File slt sll slt VP12 psel lw VP10 psel Cache add slt cpy mul prevVP slt psel VP10 sll compute−op psel VP13 add lw execution slt slt sll compute!op psel VP13 VP14 psel lw VP11 add slt cpy mul cr0 cr1

ALU transport−op decoupling dest transport!op cluster Figure 12: Execution of decoder example on SCALE. Each cluster ex- ecutes in-order, but cluster and lane decoupling allows the execution to nextVP automatically adapt to the software critical path. Critical dependencies are shown with arrows (solid for inter-cluster within a lane, dotted for Cluster 0 writeback−op decoupling cross-VP). VP

writeback!op src cluster tion more flexible, and thereby enhance cluster decoupling. A schematic diagram of the example decoder loop executing on dest AIB Register File SCALE (extracted from simulation trace output) is shown in Fig- prevVP Cache ure 12. Each cluster executes the vector-fetched AIB for each VP compute−op compute!op execution mapped to its lane, and decoupling allows each cluster to advance cr0 cr1 to the next VP independently. Execution automatically adapts to ALU transport−op decoupling the software critical path as each cluster’s local data dependencies dest transport!op cluster load? resolve. In the example loop, the accumulations of index and src valpred must execute serially, but all of the other instructions nextVP are not on the software critical path. Furthermore, the two accumu- decoupled store queue lations can execute in parallel, so the cross-iteration serialization store!op penalty is paid only once. Each loop iteration (i.e., VP) executes src cluster over a period of 30 cycles, but the combination of multiple lanes store!addr sd and cluster decoupling within each lane leads to as many as six load!data queue queue loop iterations executing simultaneously. load!data address store!data Memory Access Decoupling Figure 10: SCALE Lane Microarchitecture. The AIB caches in SCALE hold micro-op bundles. The compute-op is a local RISC operation on All VP loads and stores execute on cluster 0 (c0), and it is specially the cluster, the transport-op sends data to external clusters, and the designed to enable access-execute decoupling [11]. Typically, c0 writeback-op receives data from external clusters. Clusters 1, 2, and loads data values from memory and sends them to other clusters, 3 are basic cluster designs with writeback-op and transport-op decou- computation is performed on the data, and results are returned to c0 pling resources (cluster 1 is shown in detail, clusters 2 and 3 are shown and stored to memory. With basic cluster decoupling, c0 can con- in abstract). Cluster 0 connects to memory and includes memory access tinue execution after a load without waiting for the other clusters decoupling resources. to receive the data. Cluster 0 is further enhanced to hide memory latencies by continuing execution after a load misses in the cache, sender and the receiver are ready. Although compute-ops execute and therefore it may retrieve load data from the cache out of or- in order, each cluster contains a transport queue to allow execution der. However, like other instructions, load operations on cluster 0 to proceed without waiting for external destination clusters to re- use transport-ops to deliver data to other clusters in order, and c0 ceive the data, and a writeback queue to allow execution to proceed uses a load data queue to buffer the data and preserve the correct without waiting for data from external clusters (until it is needed ordering. by a compute-op). These queues make inter-cluster synchroniza- Importantly, when cluster 0 encounters a store, it does not stall to

6 . , t - - - - - f f - - r - - - - r - s s s e e e e 2 d d d d d P i . s i l h e s e c s e o o n d x x c s d t h d e r n n n n s l t r m m m i s l c t t a e e e u a u i n i k M o a a a a c s De n l e e fi e e u r c y n c e y s e n p e c m r n d u o ; e p g r e c C s c e g o i l f g p i o e m m e n l g i o s m n i r r l E h h a t i p DDR m . i c e r t r n a h m e a t g t d a U l a u c o e a r i s y o n r , d e t W p u - c t d e a o n l l e r n y t ; a r t d c k . d n n g u i e r e c e a c r c n h o p n u AL c o s e i g t s o y c m s s . e n t e o e c e t g t h o a VT s e u o g c n c n t s a a C wi o u a a C h n s p x a d c o e i i e s r e u o t l ) n o e fi n c l m r e e t d t g a t o S i r p l r h a h k d e o c h - q m i , 9 o i e i d t n e d p e u c a c o . c . o f a r e s c s b e e T ; o b g a u h o s o a h ; n d r s t q l e e s l r y d e e e , r - e o s m d k t l c b c n n a t c q t r s s a a t r e e s n c l f i e r h - l u i h p r i 4 d . t f e r ; o s b e a s s r u e e e h a t e e s y t a c l n a i d - r - t y i x 6 - k E e n c e n u m r i c g t p a c i t l d , a r e f d r s u s i o e wi t l u n n i t e c o m o o a a o y l n c b l s o s r o o d a l s n VT y M c r i e o r l y wr r i ) t a t F k h e l e p p c c p t / a o u s u s e s c y s AL t p r y d i c c p o e m s c e m E l a l AM i d e n f r u g r t m r o n s r n y e n e i s f n m l o h e C c n n m h a m n e l i a i d o o a i m C l t t E h d p b n h e v o y ) o s c s o a o e c d d a o S h u ; t i c d c s s r i n o c s a o c t a l r m e n b n n o m r o n s l t l e e l i g t a t e i s s s d a . l d i e , t f c e l b e e CAL h t a t c e Hz Hz Hz e ( e e e e e v a r r AL n n n s i d - e n e n e AM t h p e e n m l l l l a 2 a e o n ; i - a t h b S i v . v t i r t / AM e d e d i r p i t m g l c c c c l r M M M n g o h i a i e n h C s i m C r s i t e d n y - s b ( p t e d m t T r y s y y y r KB B B ( c a e a Un c c e i u s n a / 0 0 0 l d i r a e e s l u t S i s c c c c l c e s a u n c t i l n n o i p n d a 2 2 0 4 2 0 0 5 6 2 2 t . r s b d . l e l we f e e e - o h c y e - o b e s n DR t m o k C a o 4 3 3 DDR 4 6 3 2 4 wr 4 4 2 3 1 3 3 1 0 0 o S p p c t y a l e t t c m n r i d r u r a i d d i o e e g u m l s n c e u a , e S i c t c d t P M r s l s e c r n wi m e r i h s o d r o m d e d o q t p o e I d t l p o i t a n e b i a d y d h c e y s a e l i e i e o t m r 0 l y t o n e d fi r e - h n n S y y r e p e h t y t t c v , g u s n f i T a m y e c r o i b l o c c c s a t 0 u i c s u l M e c e - t e d c - s g e u a n c i r t e t wi s n n - d s n e a e m m u c a 4 r p n e y m e a u d y c e u o e m r s e e , h AM Un c e u n m e t u s m ( e g o a t c l w e h b n e t t s o v t l e r i o i e d s h e e a o c b n t r c m c a u n h a a n a 1 a c i e An d e l h c x h t l o u i s E d e l l r l e x . t e r fl t t M e a r t we o e l r O4 L n f t e c r f s d . r g t t a e s o e h u e f y V e s n n a e e s n DR t p s r r c i x p we r Hz o a t f t t o e n o s F o 1 s c q i f s B t p a U a a p r o , e o o a s o n p e o o c wh y h o y o e e u n e s e AL t p l g t e u n i F m p o y a p p t s r c d e - y f c e c l g M e k g e e t 5 p n n i s w c s s n e f d y y t n o l n h o r h l p d c , y d l f n e C n n u n a AI m s n t n l n n . i n n n 2 o y , b a n n c l v o f a 0 o o e o o b o o i k d i a r VT a u i q g t t b k T i e n y d e e a a e i t i e S r l t o o c e e r o e e u t l c r e e o c p t n n s b e r r c h a c 0 e a t a l a i u i a i n n n d l a g i t t r g e i n r g e . v i t f p e c l l o o p n f s e l t t c e i t e s t t i t wi f a 2 d q t - v t T l n r i e m a s a s r a o y p e o t k i s e s m l e n e s a i h a e s e e s y h u t c o m t s e k t n e a e n t i s s e t r i r p r b l c e r l l a u n k t p n u s s d r c i r n t l a c z u wi i a r u e l f c m u u r VP De l a u r e o n r o e i a a i c i a u e m c a u n m u - t m c c u o l c l t e o b b s e m r m s d - e c b o m p o y c s m c s i k s u s - - : d s t i n m a m e fi r i a a n d e c l k b i y d l e r p d a o d s i m s c a a a a h s e AM AM i t s r o o x p m f r m d a 1 m e n s m - m t t n b n a s g r r e e d p B t c p t o b e l a n i e u o i e v n a i o s s e h z o t t r r s c e t o p n l e r l a a a i e d i o e - i t e e k s s n s e l d z u s d d n n n o e o f c e a t l a n d u d m i g Nu C R AI I I C C S As L B M S L DR Da DR Da M n t i i r n c b r l a a t r i e e g v u w i ; n o g m n i e t h r s b e U e l a o e b e wr o e o p b n o a d n a l m g e c a s a s a n h l e m i - c s - r i f d i s v m p E n d t o r o d e wr e n a e t e b i u c T e d a e d t e e a k v m p m i l m c s e t s c e a g t r VT d e l t , s n l o h c h l n P d e y i d o o r r s m u s Hz l r e t i e d l m s e b p d i u a - E a m s AL h c c a n a o y a i e e i C a o d m r e o d fi n t m e o t u b o h u s e r a s t s n t h M h h h C m p a o r h a n n r c k i s n o c e h s g a c we t p s i s e c s B c a T T S A T t l , 0 n u w n t r s i c 1 l r y y e d n a g e e i t t e i a . h . y t l a e r i e l 0 o r r e y u r h r o e t n h a s h o b a b i o c fi f 4 4 fi n d T u e h wi i c o t 4 w t AI i t f m d p a d t 7 , . t l l ------s s e a e e e e e p h o n y h d d p d y h d i t i r r i t f r e a s n i r t t i e m o e t y r r k t h e h i h h h c i a i o n l n e t s s r o o h t t t t n e c o o x i h c VP s o o a t v t s a s b o d t T y i u t T r t u n o e c a we wi e c a l h e wi l t e r i c n s c s r h e a t g - p , i a d m m . o p . t b d 8 8 a t a o y l

2.5mm e e o t e r m c f o s c l s m . n o o e c n e e e t t r i e y e r e f f l a a 2 2 x i t e v l v i l c g m a n o s t e t e t o p i e t n g a h o o e i 1 t 1 s s d n e o p r e VP m m n o f u y o h t n t r s w s y c p 0 o i s e f e - t t m t s r a u t n c - n e u s o a a e e e n l r r o c e o o d o e s e r d h a l a t o o o d a , o 2 t wr h r h t o s t d p t h n o t o c r n e n e a t e u e n t e n n Lane e c e y t e c t e t 3 s m t i c , r o m o a n c e s t r o p e i e o a r h r p e s e a p y e t c p c s i a s t t r Cluster i a r r b Mult Div t e c t l o s o s r s h b u s r o s o u s t s o e - e u d y u c a i r c c y t i p t p - t a t f a a e t l f E n q t i p p 2 u v v e s o a a p n s i d e 2 s g e c RF RF RF RF e c d c t o o s - 3 c g r m o e s E n t s e o m t t E e n h r 3

ctrl ctrl ctrl ctrl c a e r n h e s d h - e e u e r r i

ALU ALU ALU ALU o h t e r n i t n h t l n a a o r s r a d l o a AL r shftr shftr shftr shftr e o h h k m e h x r o t c d t r t h o c o l h e g r t d e t VP a u r p 8 c

mux/ mux/ mux/ mux/ y o e t o l i a a s AIB AIB AIB AIB e T u p d c T C c s a p c s AL c a t c

latch latch latch latch e n o Memory Interface / Cache Control $ $ $ $ r s a o 1 s b s t s c s o a a d n t e f o p c CAL n S f a u . c d h a f ’ r i r s e C . e b a f e c . VP s l e . d v e u s s S t a

RF RF RF RF e 0 d , o e s h f s e s r s e 1 h s s d b d c AM S n s y s , n r e v d s f ctrl ctrl ctrl ctrl t n e p e f a s n i k ALU ALU ALU ALU r d a i r , t r n l i n r f l e d a o L T m e r h n s R

shftr shftr shftr shftr i o o p e a e o r e e h n a r n g o t d s r i a a e o e u p VP l s f s d d l t . c n a mux/ mux/ mux/ mux/ c t i S s s h s AIB AIB AIB AIB a e r o p d t g i n d o KB l b u d latch latch latch latch t wr t p c $ $ $ $ a d s a 3 c s e f o o s s f e B b r u d c y a a e d n v E , e t i a o d n t c a o , t y p a r e t t u e 1 2 i r e r e a r a s i s d o o - l

RF RF RF RF fi r c e a K o r c a a m e c c p d c Tags o a 3 t k a e i r v o e n t e n t c r

ctrl ctrl ctrl ctrl n t o e z t h e Cache u e a e

ALU ALU ALU ALU p u o o 2 f d c i r r s i n n p o t r p a m AL g t e m a ws a shftr shftr shftr shftr p c r - o d r a o n i o 3 a i e s e o s Control Processor n t u s t c n o u r o d s n l t m o f n ; a

mux/ mux/ mux/ mux/ d l p AM C s i

AIB AIB AIB AIB l 0 e c r y e s m l p u - u s e w g d latch latch latch latch o e p a a d

$ $ $ $ a e t e d r i l e b l l d o r i i s 1 d S e o l y d h e t C s o o s h s n c a - m t ’ e s h r a t n u r c d r e e d t l n t o a a n a p o F

RF RF RF RF t t h a n s a l e CP0 o r n t a a , h n n e v o o r s o e d e t n s e l a l d I r r ctrl ctrl ctrl ctrl e t n e t t

PC a s c g i p

ALU ALU ALU ALU l o r n o s r e s e a n 4mm e y t h s n g m n , e e s t . l i s shftr shftr shftr shftr l l o i t n c t e e n m . e a d n f , y MD s n v a n a o u y n - s wi l h c e y i p a p t e p e c mux/ mux/ mux/ mux/ d u y e , c r v a o m

AIB AIB AIB AIB t i a r o m h a l l shftr d g e c latch latch latch latch o o r s ctrl

$ $ $ $ s s h a e t fi r o s e o h l b e c d c o s

ALU e t n r h o s ; r r i t n p n o t g a a o s m d s d l byp t e d o h l Bank r a a wn

AIB AIB AIB AIB n l t , ctrl ctrl ctrl ctrl s c h a a g u (8KB) a e r - r o u i tags tags tags tags e t o o n s g f e e l Cache e l r n r n e t g i t o e W e l e

Ld/St e o e n e r e i r e r r a e o v t fl a n r d f s a a c e m a c o r o d t r n n l g g , h h i s o n o h r e i i p s d t . c RF y c s e a t s e o t r e a t s W s r a p a i o Memory Unit c h F a n d y e r s o p v i t e m e m r d l n f s e l c i f d e s h c c c t o u a o e i l r g e f . s b o e t o e r a g a e d l s u Ac p e e s s c n i t c t d e a l t h a h h b t m r . , s h e v u s t e i s e v i y t v s o Crossbar r i a U s s o e m e c n a a n r a c l o p s d e k e u h y b e T - n r d t e i s n S l e r r u i t e t l e p r ; a t r c t a c u i l n d r e m s a . t a 2 . , e h n c u a m r o a P c n e o wh 0 i l a c e r m i a a l a y . o i s ] n t a s m r I 3 o m t e i e VT t o e o y i t s c h T s a t p c r h o y l v s p b - e r a c wh e 5 r m k r t s g r a n t h s t S r r p r 8 i e r i v r t e r i s a n r o t d c c o M r t 1 e m . s n t o o s s d e o t n e e r e 1 s e o o AM k f o n i o a o l B o e [ i u t t . i i e l s n o s v r s e . r t , i t s o n v h a u o t s P t s u e l i h B u i c e 0 r t n m a a e n t t g k l s s c e c o m p t m t a s e o i d n u u u b fl e i h e e u AI : r c e u e n n e c p t r m r a n t m M DR e t u v a s a c e t s r d i - c Bank Bank Bank e g s l i d U i 3 AI t c i a n l (8KB) (8KB) (8KB) v e f r - s r e s o e r s Cache Cache Cache e m s h a e h t h S a n r m i p n r 1 y n P a n m i c t o e a b o t r o d s t c - r i e a i i h x x - a d - r a r r y a o a g r t d t o o C a s t u i a a r e e 2 n e o e o t t e p d n d r e f i t e VT e e c p o f d c m o e d n f r c s b t l a ; p . o h h o t e n y r e u r o t h n i t a t r c e t t t s u l o l e a i d a e r i c s r g m m , mo i e e e h c c m b T W T i s s i r t t s e t t c 4 o u s o r g a c t a r t t e n a u e o n o o h e . d i r n a r a a p a p r a n e s u h n 0 l t i e r o o r a e u n o n o a o DDR f f p u F f b t f a s w i c p t V V d i 1 b s f T a f s o r a n 3 l o v f o d c me u l bubble between the producing instruction and the dependent in- The BOPS Manta is a clustered VLIW DSP with four clusters each struction). Cross-VP transports are able to have zero cycle latency capable of executing up to five instructions per cycle on 64-bit dat- because the clusters are physically closer together and there is less apaths. The Manta 2.0 runs at 136 MHz with 128 KB of on-chip fan-in for the receive operation. Cache accesses have a two cy- memory connected to a 32-bit memory port running at 136 MHz. cle latency (two bubble cycles between load and use), and accesses The TI TMS TMS320C6416 is a clustered VLIW DSP with two which miss in the cache have a minimum latency of 35 cycles. clusters each capable of executing up to four instructions per cycle. To show scaling effects, we model four SCALE configurations It runs at 720 MHz and has a 16 KB instruction cache and a 16 KB with one, two, four, and eight lanes. The one, two, and four data cache together with 1 MB of on-chip SRAM. The TI has a 64- lane configurations each include four cache banks and one 64-bit bit memory interface running at 720 MHz. Apart from the Au1100 DRAM port. For eight lanes, the memory system was doubled to and SCALE, all other processors implement SIMD operations on eight cache banks and two 64-bit memory ports to appropriately packed subword values and these are widely exploited throughout match the increased compute bandwidth. the benchmark set. Overall, the results show that SCALE can flexibly provide com- 4.2 Benchmarks and Results petitive performance with larger and more complex processors on a wide range of codes from different domains, and that performance We have implemented a selection of benchmarks (Table 2) to generally scales well when adding new lanes. The results also illus- illustrate the key features of SCALE, including examples from net- trate the large speedups possible when algorithms are extensively work processing, image processing, cryptography, and audio pro- tuned for a highly parallel processor versus compiled from stan- cessing. The majority of these benchmarks come from the EEMBC dard reference code. SCALE results for fft and viterbi are benchmark suite. The EEMBC benchmarks may either be run “out- not as competitive with the DSPs. This is partly due to these be- of-the-box” (OTB) as compiled unmodified C code, or they may be ing preliminary versions of the code with further scope for tuning optimized (OPT) using assembly coding and arbitrary hand-tuning. (e.g., moving the current radix-2 FFT to radix-4 and using outer- This enables a direct comparison of SCALE running hand-written loop vectorization for viterbi) and partly due to the DSPs hav- assembly code to optimized results from industry processors. Al- ing special support for these operations (e.g., complex multiply on though OPT results match the typical way in which these pro- BOPS). We expect SCALE performance to increase significantly cessors are used, one drawback of this form of evaluation is that with the addition of subword operations and with improvements to performance depends greatly on programmer effort, especially as the microarchitecture driven by these early results. EEMBC permits algorithmic and data-structure changes to many of the benchmark kernels, and optimizations used for the reported 4.3 Mapping Parallelism to SCALE results are often unpublished. Also, not all of the EEMBC results are available for all processors, as results are often submitted for The SCALE VT architecture allows software to explicitly en- only one of the domain-specific suites (e.g., telecom). code the parallelism and locality available in an application. This We made algorithmic changes to several of the EEMBC bench- section evaluates the architecture’s expressiveness in mapping dif- marks: rotate blocks the algorithm to enable rotating an 8-bit ferent types of code. block completely in registers, pktflow implements the packet de- Data-Parallel Loops with No Control Flow scriptor queue using an array instead of a linked list, fir optimizes the default algorithm to avoid copying and exploit reuse, fbital Data-parallel loops with no internal control flow, i.e. simple vec- uses a binary search to optimize the bit allocation, conven uses torizable loops, may be ported to the VT architecture in a similar bit packed input data to enable multiple bit-level operations to be manner as other vector architectures. Vector-fetch commands en- performed in parallel, and fft uses a radix-2 hybrid Stockham al- code the cross-iteration parallelism between blocks of instructions, gorithm to eliminate bit-reversal and increase vector lengths. while vector-memory commands encode data locality and enable Figure 14 shows the simulated performance of the various optimized memory access. The EEMBC image processing bench- SCALE processor configurations relative to several reasonable marks (rgbcmy, rgbyiq, hpg) are examples of streaming vec- competitors from among those with the best published EEMBC torizable code for which SCALE is able to achieve high perfor- benchmark scores in each domain. For each of the different bench- mance that scales with the number of lanes in the VTU. A 4-lane marks, Table 3 shows VP configuration and vector-length statistics, SCALE achieves performance competitive with VIRAM for rg- and Tables 4 and 5 give statistics showing the effectiveness of the byiq and rgbcmy despite having half the main memory band- SCALE control and data hierarchies. These are discussed further width, primarily because VIRAM is limited by strided accesses in the following sections. while SCALE refills the cache with unit-stride bursts and then has The AMD Au1100 was included to validate the SCALE con- higher strided bandwidth into the cache. For the unit-stride hpg trol processor OTB performance, as it has a similar structure and benchmark, performance follows memory bandwidth with the 8- clock frequency, and also uses gcc. The Philips TriMedia TM lane SCALE approximately matching VIRAM. 1300 is a five-issue VLIW processor with a 32-bit datapath. It runs at 166 MHz and has a 32 KB L1 instruction cache and 16 KB L1 Data-Parallel Loops with Conditionals data cache, with a 32-bit memory port running at 125 MHz. The Traditional vector machines handle conditional code with predica- Motorola PowerPC (MPC7447) is a four-issue out-of-order super- tion (masking), but the VT architecture adds the ability to condi- scalar processor which runs at 1.3 GHz and has 32 KB separate L1 tionally branch. Predication can be less overhead for small condi- instruction and data caches, a 512 KB L2 cache, and a 64-bit mem- tionals, but branching results in less work when conditional blocks ory port running at 133 MHz. The OPT results for the processor are large. EEMBC dither is an example of a large conditional use its Altivec SIMD unit which has a 128-bit datapath and four block in a data parallel loop. This benchmark converts a grey-scale execution units. The VIRAM processor [4] is a research vector image to black and white, and the dithering algorithm handles white processor with four 64-bit lanes. VIRAM runs at 200 MHz and in- pixels as a special case. In the SCALE code, each VP executes a cludes 13 MB of embedded DRAM supporting up to 256 bits each conditional fetch for each pixel, executing only 18 SCALE instruc- of load and store data, and four independent addresses per cycle. tions for white pixels versus 49 for non-white pixels.

8 EEMBC Data OTB OPT Kernel Ops/ Mem B/ Loop Type Mem Benchmarks Set Itr/Sec Itr/Sec Speedup Cycle Cycle DP DC XI DI DE FT VM VP Description rgbcmy consumer - 126 1505 11.9 6.1 3.2 RGB to CMYK color conversion rgbyiq consumer - 56 1777 31.7 9.9 3.1 RGB to YIQ color conversion hpg consumer - 108 3317 30.6 9.5 2.0 High pass grey-scale filter text office - 299 435 1.5 0.3 0.0 Printer language parsing dither office - 149 653 4.4 5.0 0.2 Floyd-Steinberg grey-scale dithering rotate office - 704 10112 14.4 7.5 0.0 Binary image 90 degree rotation lookup network - 1663 8850 5.3 6.3 0.0 IP route lookup using Patricia Trie ospf network - 6346 7044 1.1 1.3 0.0 Dijkstra shortest path first 512KB 6694 127677 19.1 7.8 0.6 pktflow network 1MB 2330 25609 11.0 3.0 3.6 IP packet processing 2MB 1189 13473 11.3 3.1 3.7 pntrch auto - 8771 38744 4.4 2.3 0.0 Pointer chasing, searching linked list fir auto - 56724 6105006 107.6 8.7 0.3 Finite impulse response filter typ 860 20897 24.3 4.0 0.0 fbital telecom step 12523 281938 22.5 2.5 0.0 Bit allocation for DSL modems pent 1304 60958 46.7 3.6 0.0 fft telecom all 6572 89713 13.6 6.1 0.0 256-pt fixed-point complex FFT viterb telecom all 1541 7522 4.9 4.2 0.0 Soft decision Viterbi decoder data1 279339 3131115 11.2 4.8 0.2 autocor telecom data2 1888 64148 34.0 11.2 0.0 Fixed-point autocorrelation data3 1980 78751 39.8 13.0 0.0 data1 2899 2447980 844.3 9.8 0.0 conven telecom data2 3361 3085229 917.8 10.4 0.0 Convolutional encoder data3 4259 3703703 869.4 9.5 0.1 Other Data OTB Total OPT Total Kernel Ops/ Mem B/ Loop Type Mem Benchmarks Set Cycles Cycles Speedup Cycle Cycle DP DC XI DI DE FT VM VP Description rijndael MiBench large 420.8M 219.0M 2.4 2.5 0.0 Advanced Encryption Standard sha MiBench large 141.3M 64.8M 2.2 1.8 0.0 Secure hash algorithm qsort MiBench small 35.0M 21.4M 3.5 2.3 2.7 Quick sort of strings adpcm enc Mediabench - 7.7M 4.3M 1.8 2.3 0.0 Adaptive Differential PCM encode adpcm dec Mediabench - 6.3M 1.0M 7.9 6.7 0.0 Adaptive Differential PCM decode li SpecInt95 test 1,340.0M 1,151.7M 5.5 2.8 2.7 Lisp Table 2: Benchmark Statistics and Characterization. All numbers are for the default SCALE configuration with four lanes. Results for multiple input data sets are shown separately if there was significant variation, otherwise an all data set indicates results were similar across inputs. As is standard practice, EEMBC statistics are for the kernel only. Total cycle numbers for non-EEMBC benchmarks are for the entire application, while the remaining statistics are for the kernel of the benchmark only (the kernel excludes benchmark overhead code and for li the kernel consists of the garbage collector only). The Mem B/Cycle column shows the DRAM bandwidth in bytes per cycle. The Loop Type column indicates the categories of loops which were parallelized when mapping the benchmark to SCALE: [DP] data-parallel loop with no control flow, [DC] data-parallel loop with conditional thread-fetches, [XI] loop with cross-iteration dependencies, [DI] data-parallel loop with inner-loop, [DE] loop with data-dependent exit condition, and [FT] free-running threads. The Mem column indicates the types of memory accesses performed: [VM] for vector-memory accesses and [VP] for individual VP loads and stores.

rgbcmy rgbyiq hpg text dither rotate lookup ospf pktflw pntrch fir fbital fft viterb autcor conven rijnd sha qsort adpcm.e adpcm.d li.gc avg. VP config: private regs 2.0 1.0 5.0 2.7 10.0 16.0 8.0 1.0 5.0 7.0 4.0 3.0 9.0 3.6 3.0 6.0 13.0 1.0 26.0 4.0 1.0 4.4 6.2 VP config: shared regs 10.0 18.0 3.0 3.6 16.0 3.0 9.0 5.0 12.0 14.5 2.0 8.0 1.0 3.9 2.0 7.0 5.0 3.8 20.0 19.0 17.0 5.1 8.5 vlmax 52.0 120.0 60.0 90.8 28.1 24.0 40.0 108.0 56.0 40.0 64.0 116.0 36.0 49.8 124.0 40.0 28.0 113.5 12.0 48.0 96.0 112.7 66.3 vl 52.0 120.0 53.0 6.7 24.4 18.5 40.0 1.0 52.2 12.0 60.0 100.0 25.6 16.6 32.0 31.7 4.0 5.5 12.0 47.6 90.9 62.7 39.5 Table 3: VP configuration and vector-length statistics as averages of data recorded at each vector-fetch command. The VP configuration register counts represent totals across all four clusters, vlmax indicates the average maximum vector length, and vl indicates the average vector length.

Loops with Cross-Iteration Dependencies val to 5 cycles, yielding a maximum SCALE IPC of . Many loops are non-vectorizable because they contain loop-carried SCALE sustains an average of 6.7 compute-ops per cycle and data dependencies from one iteration to the next. Nevertheless, achieves a speedup of compared to the control processor. there may be ample loop parallelism available when there are oper- The two MiBench cryptographic kernels, sha and rijndael, ations in the loop which are not on the critical path of the cross- have many loop-carried dependences. The sha mapping uses 5 iteration dependency. The vector-thread architecture allows the cross-VP data transfers, while the rijndael mapping vector- parallelism to be exposed by making the cross-iteration (cross- izes a short four-iteration inner loop. SCALE is able to exploit VP) data transfers explicit. In contrast to software pipelining for instruction-level parallelism within each iteration of these kernels a VLIW architecture, the vector-thread code need only schedule by using multiple clusters, but, as shown in Figure 14, performance instructions locally in one loop iteration. As the code executes on also improves as more lanes are added. a vector-thread machine, the dependencies between iterations re- Data-Parallel Loops with Inner-Loops solve dynamically and the performance automatically adapts to the software critical path and the available hardware resources. Often an inner loop has little or no available parallelism, but Mediabench ADPCM contains one such loop (similar to Fig- the outer loop iterations can run concurrently. For example, the ure 8) with two loop-carried dependencies that can propagate in EEMBC lookup code models a router using a Patricia Trie to parallel. The loop is mapped to a single SCALE AIB with 35 VP perform IP Route Lookup. The benchmark searches the trie for instructions. Cross-iteration dependencies limit the initiation inter- each IP address in an input vector, with each lookup chasing point- ers through around 5–12 nodes of the trie. Very little parallelism is

9 70 30 200 AMD Au1100 396 MHz (OTB) AMD Au1100 396 MHz (OTB) PowerPC 1.3 GHz (OTB) PowerPC 1.3 GHz (OTB) 51x TM1300 166 MHz (OPT) PowerPC 1.3 GHz (OPT) 60 VIRAM 200 MHz (OPT) 25 SCALE 1/2/4/8 400 MHz (OPT) SCALE 1/2/4/8 400 MHz (OPT) 160 50 20 120 40 15 30 80 10 20 40 5 10

1 1 1 Speedup vs. SCALE MIPS Control Processor (OTB) rgbcmy rgbyiq hpg text dither rotate lookup ospf pktflow/2MB pntrch fir

120 2000 9 AMD Au1100 396 MHz (OTB) SCALE 1/2/4/8 400 MHz (OPT) 110 PowerPC 1.3 GHz (OTB) PowerPC 1.3 GHz (OPT) 1800 8 100 VIRAM 200 MHz (OPT) TI TMS320C6416 720 MHz (OPT) 1600 BOPS Manta v2.0 136 MHz (OPT) 7 90 SCALE 1/2/4/8 400 MHz (OPT) 1400 80 6 70 1200 5 60 1000 4 50 800 40 3 600 30 400 2 20 1 10 200 1

Speedup vs. SCALE MIPS Control Processor (OTB) 0 0 fbital/pent fft viterb autocor/data3 conven/data3 rijndael/large sha/large qsort/small adpcm_enc adpcm_dec li/test (GC)

Figure 14: Performance Results: Twenty-two benchmarks illustrate the performance of four SCALE configurations (1 Lane, 2 Lanes, 4 Lanes, 8 Lanes) compared to various industry architectures. Speedup is relative to the SCALE MIPS control processor. The EEMBC benchmarks are compared in terms of iterations per second, while the non-EEMBC benchmarks are compared in terms of cycles to complete the benchmark kernel. These numbers correspond to the Kernel Speedup column in Table 2. For benchmarks with multiple input data sets, results for a single representative data set are shown with the data set name indicated after a forward slash. available in each lookup, but many lookups can run simultaneously. the control processor interaction is eliminated. Each VP thread runs In the SCALE implementation, each VP handles one IP lookup in a continuous loop getting tasks from a work-queue accessed us- using thread-fetches to traverse the trie. The ample thread paral- ing atomic memory operations. An advantage of this method is that lelism keeps the lanes busy executing 6.3 ops/cycle by interleaving it achieves good load-balancing between the VPs and can keep the the execution of multiple VPs to hide memory latency. Vector- VTU constantly utilized. fetches provide an advantage over a pure multithreaded machine by Three benchmarks were mapped with free-running threads. The efficiently distributing work to the VPs, avoiding contention for a pntrch benchmark searches for tokens in a doubly-linked list, and shared work queue. Additionally, vector-load commands optimize allows up to five searches to execute in parallel. The qsort bench- the loading of IP addresses before the VP threads are launched. mark uses quick-sort to alphabetize a list of words. The SCALE mapping recursively divides the input set and assigns VP threads Reductions and Data-Dependent Loop Exit Conditions to sort partitions, using VP function calls to implement the com- SCALE provides efficient support for arbitrary reduction opera- pare routine. The benchmark achieves 2.3 ops/cycle despite a high tions by using shared registers to accumulate partial reduction re- cache miss rate. The ospf benchmark has little available paral- sults from multiple VPs on each lane. The shared registers are then lelism and the SCALE implementation maps the code to a single combined across all lanes at the end of the loop using the cross-VP VP to exploit ILP for a small speedup. network. The pktflow code uses reductions to count the number of packets processed. Mixed Parallelism Loops with data-dependent exit conditions (“while” loops) are Some codes exploit a mixture of parallelism types to accelerate per- difficult to parallelize because the number of iterations is not known formance and improve efficiency. The garbage collection portion of in advance. For example, the strcmp and strcpy standard C li- the lisp interpreter (li) is split into two phases: mark, which tra- brary routines used in the text benchmark loop until the string verses a tree of currently live lisp nodes and sets a flag bit in every termination character is seen. The cross-VP network can be used visited node, and sweep, which scans through the array of nodes to communicate exit status across VPs but this serializes execution. and returns a linked list containing all of the unmarked nodes. Dur- Alternatively, iterations can be executed speculatively in parallel ing mark, the SCALE code sets up a queue of nodes to be pro- and then nullified after the correct exit iteration is determined. The cessed and uses a stripmine loop to examine the nodes, mark them, check to find the exit condition is coded as a cross-iteration reduc- and enqueue their children. In the sweep phase, VPs are assigned tion operation. The text benchmark executes most of its code on segments of the allocation array and then each construct a list of un- the control processor, but uses this technique for the string routines marked nodes within their segment in parallel. Once the VP threads to attain a 1.5 overall speedup. terminate, the control processor vector-fetches an AIB that stitches the individual lists together using cross-VP data transfers, thus pro- Free-Running Threads ducing the intended structure. Although the garbage collector has When structured loop parallelism is not available, VPs can be used a high cache miss rate, the high degree of parallelism exposed in to exploit arbitrary thread parallelism. With free-running threads, this way allows SCALE to sustain 2.8 operations/cycle and attain a

10 rgbcmy rgbyiq hpg text dither rotate lookup ospf pktflw pntrch fir fbital fft viterb autcor conven rijnd sha qsort adpcm.e adpcm.d li.gc compute-ops / AIB 21.0 29.0 3.7 4.9 8.6 19.7 5.1 16.5 4.2 7.0 4.0 7.4 3.0 8.8 3.0 7.3 13.4 14.1 9.1 61.0 35.0 8.9 compute-ops / AIB tag-check 273.0 870.0 48.6 8.2 10.4 91.1 5.3 18.5 21.5 7.0 59.6 14.2 19.4 36.8 24.0 57.5 13.4 25.4 9.1 726.2 795.3 12.2 compute-ops / ctrl. proc. instr. 136.0 431.9 44.7 0.2 23.7 85.7 639.2 857.8 152.3 3189.6 18.1 62.8 5.8 4.9 23.6 19.7 8.9 3.9 5.8 229.2 186.0 122.7 thread-fetches / VP thread 0.0 0.0 0.0 0.0 3.8 0.0 26.7 3969.0 0.2 3023.7 0.0 1.0 0.0 0.0 0.0 0.0 0.9 0.0 113597.9 0.0 0.0 2.4 AIB cache miss percent 0.0 0.0 0.0 0.0 0.0 33.2 0.0 22.5 0.0 1.5 1.6 0.0 0.2 0.0 0.0 0.0 0.0 0.0 4.3 0.0 0.1 0.4 Table 4: Control hierarchy statistics. The first three rows show are the average number of compute-ops per executed AIB, per AIB tag-check (caused by either a vector-fetch, VP-fetch, or thread-fetch), and per executed control processor instruction. The next row shows the average number thread-fetches issued by each dynamic VP thread (launched by a vector-fetch or VP-fetch). The last row shows the miss rate for AIB tag-checks.

rgbcmy rgbyiq hpg text dither rotate lookup ospf pktflw pntrch fir fbital fft viterb autcor conven rijnd sha qsort adpcm.e adpcm.d li.gc avg. sources: chain register 75.6 92.9 40.0 31.2 41.3 5.8 21.0 13.1 62.7 31.0 38.8 30.5 31.9 37.1 48.4 46.8 81.5 115.8 32.6 20.3 34.1 39.4 44.2 sources: register file 99.3 86.0 106.7 75.3 94.2 109.8 113.6 127.0 84.7 115.0 113.3 114.4 84.5 87.3 96.9 90.6 72.1 27.9 102.0 97.4 110.1 77.6 94.8 sources: immediate 28.4 31.0 6.7 13.1 27.2 64.0 21.8 52.9 45.7 38.8 2.6 30.2 7.5 13.9 0.0 50.0 23.1 38.9 66.6 35.4 38.7 71.7 32.2 dests: chain register 56.7 58.5 40.0 18.2 43.8 5.8 21.8 18.5 77.9 38.5 38.8 22.8 31.9 37.1 48.4 40.6 81.1 84.5 32.9 12.8 24.8 39.2 39.8 dests: register file 33.8 31.2 60.0 52.8 46.0 59.6 48.2 87.3 52.6 23.1 60.7 83.0 51.2 64.9 51.5 75.0 22.0 15.5 43.1 81.8 44.2 26.8 50.6 ext. cluster transports 52.0 51.6 53.3 43.0 45.9 34.5 36.6 53.5 57.6 30.9 38.8 90.7 31.9 74.1 48.5 40.6 29.5 68.3 13.9 72.7 21.7 56.3 47.5 load elements 14.2 10.3 20.0 14.6 14.7 5.7 14.8 21.8 22.0 15.4 19.0 15.1 20.7 13.9 25.0 12.5 28.4 15.4 18.1 6.4 9.3 13.0 15.9 load addresses 14.2 10.3 5.3 3.7 8.1 1.7 14.2 21.8 20.8 15.4 5.4 9.4 8.0 5.3 7.4 7.9 25.7 12.3 18.1 4.8 9.3 11.6 10.9 load bytes 14.2 10.3 20.0 14.6 30.2 5.7 59.1 87.4 54.2 61.6 75.9 30.2 41.3 38.4 49.9 25.1 113.4 61.7 64.9 21.4 27.9 52.0 43.6 load bytes from DRAM 14.2 10.3 7.5 0.0 2.9 0.3 0.2 0.0 115.3 0.0 0.4 0.0 0.0 0.0 0.1 0.0 0.0 0.0 59.1 0.0 0.0 39.2 11.3 store elements 4.7 10.3 6.7 4.8 3.5 5.8 0.0 10.4 1.7 0.3 1.0 0.5 16.9 9.3 0.0 3.1 3.3 3.5 15.5 1.1 3.1 9.5 5.2 store addresses 4.7 10.3 1.8 1.4 1.8 5.8 0.0 10.4 0.4 0.3 0.5 0.1 4.2 6.8 0.0 0.8 1.6 3.0 15.5 1.1 3.1 9.5 3.8 store bytes 18.9 10.3 6.7 4.8 6.0 5.8 0.0 41.5 6.8 1.2 4.2 1.0 33.8 29.1 0.1 12.5 13.2 14.0 62.1 1.1 6.2 38.1 14.4 store bytes to DRAM 18.9 10.3 6.7 0.5 0.7 0.0 0.0 0.0 8.4 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.5 0.0 0.0 46.6 6.5

Table 5: Data hierarchy statistics. The counts are scaled to reflect averages per 100 compute-ops executed in each benchmark, and the average (avg) column gives equal weight to all the benchmarks. Compute-op sources are broken down as coming from chain registers, the register file, or immediates; and compute-op and writeback-op destinations are broken down as targeting chain registers or the register file. The ext. cluster transports row reflects the number of results sent to external clusters. The load elements row reflects the number of elements accessed by either VP loads or vector-loads, while the load addresses row reflects the number of cache accesses. The load bytes row reflects the total number of bytes for the VP loads and vector-loads, while the load bytes from DRAM row reflects the DRAM bandwidth used to retrieve this data. The breakdown for stores corresponds to the breakdown for loads. speedup of 5.5 over the control processor alone. around 32% of all register sources and 44% of all register destina- tions. Table 3 shows that across all benchmarks, VP configurations 4.4 Locality and Efficiency use an average of 8.5 shared and 6.2 private registers, with an av- The strength of the SCALE VT architecture is its ability to cap- erage maximum vector length above 64 (16 VPs per lane). The ture a wide variety of parallelism in applications while using simple significant variability in register requirements for different kernels microarchitectural mechanisms that exploit locality in both control stresses the importance of allowing software to configure VPs with and data hierarchies. just enough of each register type. A VT machine amortizes control overhead by exploiting the lo- Vector-memory commands enforce spatial locality by moving cality exposed by AIBs and vector-fetch commands, and by factor- data between memory and the VP registers in groups. This im- ing out common control code to run on the control processor. A proves performance and saves memory system energy by avoid- vector-fetch broadcasts an AIB address to all lanes and each lane ing the additional arbitration, tag-checks, and bank conflicts that performs a single tag-check to determine if the AIB is cached. On would occur if each VP requested elements individually. Table 5 a hit, an execute directive is sent to the clusters which then retrieve shows the reduction in memory addresses from vector-memory the instructions within the AIB using a short (5-bit) index into the commands. The maximum improvement is a factor of four, when small AIB cache. The cost of each instruction fetch is on par with each vector cache access loads or stores one element per lane. The a register file read. On an AIB miss, a vector-fetch will broadcast VT architecture can exploit memory data-parallelism even in loops AIBs to refill all lanes simultaneously. The vector-fetch ensures an with non-data-parallel compute. For example, the fbital, text, AIB will be reused by each VP in a lane before any eviction is pos- and adpcm enc benchmarks use vector-memory commands to ac- sible. When an AIB contains only a single instruction on a cluster, cess data for vector-fetched AIBs with cross-VP dependencies. a vector-fetch will keep the ALU control lines fixed while each VP Table 5 shows that the SCALE data cache is effective at reduc- executes its operation, further reducing control energy. ing DRAM bandwidth for most of the benchmarks. Two excep- As an example of amortizing control overhead, rbgyiq runs on tions are the pktflow and li benchmarks for which the DRAM SCALE with a vector-length of 120 and vector-fetches an AIB with bytes transferred exceed the total bytes accessed. The current de- 29 VP instructions. Thus, each vector-fetch executes 3,480 instruc- sign always transfers 32-byte lines on misses, but support for non- tions on the VTU, 870 instructions per tag-check in each lane. This allocating loads and stores could help reduce the bandwidth for is an extreme example, but vector-fetches commonly execute 10s– these benchmarks. 100s of instructions per tag-check even for non-vectorizable loops Clustering in SCALE is area and energy efficient and cluster de- such as adpcm (Table 4). coupling improves performance. The clusters each contain only AIBs also help in the data hierarchy by allowing the use of chain a subset of all possible functional units and a small register file registers, which reduces register file energy; and sharing of tem- with few ports, reducing size and wiring energy. Each cluster ex- porary registers, which reduces the register file size needed for a ecutes compute-ops and inter-cluster transport operations in order, large number of VPs. Table 5 shows that chain registers comprise requiring only simple interlock logic with no inter-thread arbitra-

11 tion or dynamic inter-cluster bypass detection. Independent control 6. Conclusion on each cluster enables decoupled cluster execution to hide large The vector-thread architectural paradigm allows software to inter-cluster or memory latencies. This provides a very cheap form more efficiently encode the parallelism and locality present in many of SMT where each cluster can be executing code for different VPs applications, while the structure provided in the hardware-software on the same cycle (Figure 12). interface enables high-performance implementations that are effi- cient in area and power. The VT architecture unifies support for all 5. Related Work types of parallelism and this flexibility enables new ways of paral- The VT architecture draws from earlier vector architectures [9], lelizing codes, for example, by allowing vector-memory operations and like vector microprocessors [14, 6, 3] the SCALE VT imple- to feed directly into threaded code. The SCALE prototype demon- mentation provides high throughput at low complexity. Similar to strates that the VT paradigm is well-suited to embedded applica- CODE [5], SCALE uses decoupled clusters to simplify chaining tions, allowing a single relatively small design to provide competi- control and to reduce the cost of a large vector register file support- tive performance across a range of application domains. Although ing many functional units. However, whereas CODE uses register this paper has focused on applying VT to the embedded domain, renaming to hide clusters from software, SCALE reduces hardware we anticipate that the vector-thread model will be widely applicable complexity by exposing clustering and statically partitioning inter- in other domains including scientific computing, high-performance cluster transport and writeback operations. graphics processing, and machine learning. The Imagine [8] stream processor is similar to vector machines, with the main enhancement being the addition of stream load and 7. Acknowledgments store instructions that pack and unpack arrays of multi-field records This work was funded in part by DARPA PAC/C award F30602- stored in DRAM into multiple vector registers, one per field. In 00-2-0562, NSF CAREER award CCR-0093354, an NSF graduate comparison, SCALE uses a conventional cache to enable unit- fellowship, donations from Infineon Corporation, and an equipment stride transfers from DRAM, and provides segment vector-memory donation from Intel Corporation. commands to transfer arrays of multi-field records between the cache and VP registers. Like SCALE, Imagine improves register file locality compared with traditional vector machines by execut- 8. References ing all operations for one loop iteration before moving to the next. [1] T.-C. Chiueh. Multi-threaded vectorization. In ISCA-18, May 1991. However, Imagine instructions use a low-level VLIW ISA that ex- [2] C. R. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines. Australia poses machine details such as number of physical registers and Computer Science Communications, 23(4):80–88, 2001. lanes, whereas SCALE provides a higher-level abstraction based [3] K. Kitagawa, S. Tagaya, Y. Hagihara, and Y. Kanoh. A hardware on VPs and AIBs. overview of SX-6 and SX-7 supercomputer. NEC Research & VT enhances the traditional vector model to support loops with Development Journal, 44(1):2–7, Jan 2003. cross-iteration dependencies and arbitrary internal control flow. [4] C. Kozyrakis. Scalable vector media-processors for embedded Chiueh’s multi-threaded vectorization [1] extends a vector ma- systems. PhD thesis, University of California at Berkeley, May 2002. chine to handle loop-carried dependencies, but is limited to a sin- [5] C. Kozyrakis and D. Patterson. Overcoming the limitations of gle lane and requires the to have detailed knowledge of conventional vector processors. In ISCA-30, June 2003. [6] C. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic´, all functional unit latencies. Jesshope’s micro-threading [2] uses N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, a vector-fetch to launch micro-threads which each execute one R. Thomas, N. Treuhaft, and K. Yelick. Scalable Processors in the loop iteration, but whose execution is dynamically scheduled on Billion-Transistor Era: IRAM. IEEE Computer, 30(9):75–78, Sept a per-instruction basis. In contrast to VT’s low-overhead direct 1997. cross-VP data transfers, cross-iteration synchronization is done us- [7] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz. ing full/empty bits on shared global registers. Like VT, Multi- Smart Memories: A modular reconfigurable architecture. In Proc. scalar [12] statically determines loop-carried register dependencies ISCA 27, pages 161–171, June 2000. and uses a ring to pass cross-iteration values. But Multiscalar uses [8] S. Rixner, W. Dally, U. Kapasi, B. Khailany, A. Lopez-Lagunas, P. Mattson, and J. Owens. A bandwidth-efficient architecture for speculative execution with dynamic checks for memory dependen- media processing. In MICRO-31, Nov 1998. cies, while VT dispatches multiple non-speculative iterations si- [9] R. M. Russel. The CRAY-1 computer system. Communications of the multaneously. Multiscalar can execute a wider range of loops in ACM, 21(1):63–72, Jan 1978. parallel, but VT can execute many common parallel loop types with [10] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, much simpler logic and while using vector-memory operations. S. W. Keckler, and C. Moore. Exploiting ILP, TLP, and DLP with the Several other projects are developing processors capable of ex- polymorphous TRIPS architecture. In ISCA-30, June 2003. ploiting multiple forms of application parallelism. The Raw [13] [11] J. E. Smith. Dynamic instruction scheduling and the Astronautics project connects a tiled array of simple processors. In contrast to ZS-1. IEEE Computer, 22(7):21–35, July 1989. [12] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar SCALE’s direct inter-cluster data transfers and cluster decoupling, processors. In ISCA-22, pages 414–425, June 1995. inter-tile communication on Raw is controlled by programmed [13] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, switch processors and must be statically scheduled to tolerate laten- J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and cies. The Smart Memories [7] project has developed an architecture A. Agarwal. Baring it all to software: Raw machines. IEEE with configurable processing tiles which support different types of Computer, 30(9):86–93, Sept 1997. parallelism, but it has different instruction sets for each type and [14] J. Wawrzynek, K. Asanovic´, B. Kingsbury, J. Beck, D. Johnson, and requires a reconfiguration step to switch modes. The TRIPS pro- N. Morgan. Spert-II: A vector microprocessor system. IEEE cessor [10] similarly must explicitly morph between instruction, Computer, 29(3):79–86, Mar 1996. [15] M. Zhang and K. Asanovic´. Highly-associative caches for low-power thread, and data parallelism modes. These mode switches limit the processors. In Kool Chips Workshop, MICRO-33, Dec 2000. ability to exploit multiple forms of parallelism at a fine-grain, in contrast to SCALE which seamlessly combines vector and threaded execution while also exploiting local instruction-level parallelism.

12