<<

SINGLE CHIP DSP ARRAY PROCESSOR: 100 MILLION + TRANSISTORS WITH MULTITHREADING APPROACH

Radovan Sernec1 Matej Zajc2 -XULM)7DVLþ2 1BIA Ltd., Teslova 30, 1000 Ljubljana, Slovenia [email protected] 2)DNXOWHWD]DHOHNWURWHKQLNR7UåDãND/MXEOMDQD6ORYHQLD [email protected]

ABSTRACT follows: firstly a case study of systolic 1D convolution and its We propose an efficient programmable parallel architecture for transformation to multithreaded case is presented, followed by DSP and matrix algebra applications that can exploit parallelism at processor architecture overview and possible implementation (topology) level via systolic/SIMD array processing and combining processor array with data memories on the same chip. at instruction level via multiple-issue control processor capable of multithreading. Our premise is: »One array – one chip« and 2. MULTITHREADING IN SYSTOLIC/SIMD ARRAYS integration of systolic/SIMD processing on the same processor Let’s analyse why systolic can not be fully array with required data storage. Multithreading on systolic/SIMD pipelined on processor cycle level and thus achieve a throughput arrays is analysed on examples, which show that substantial of 1 result/processor clock cycle. By systolic cycle we mean a speedups are possible (100% - 800%) when up to four threads are global clock cycle used to move data elements through the systolic interleaved on cycle-by-cycle basis. We are targeting processor array and all interprocessor element communication registers are element (PE) granularities in word range and employ support for synchronised to this clock. The throughput can be limited due to floating-point operations. Furthermore the array integrates data four different reasons: memory at two levels of hierarchy: local per PE (SIMD) and 1. Lower than 100 % efficiency of a given systolic algorithm; global for the whole processing array (systolic). Complexity of when input data are spaced with dummy values in order to such a system is explored in detail and is shown that 32 PE array satisfy proper data synchronisation within array. If after each can be implemented on a 120 million transistor chip. input data element a dummy one follows, efficiency drops to 50 %. 1D convolution algorithm studied bellow is such a 1. INTRODUCTION case. Systolic arrays are efficient parallel architectures used in digital 2. Read-after-write hazards within processing cell. signal processing, for solving linear algebra algorithms and other 3. Long latency of operations; notation of systolic algorithms problems. They are exploiting regularity in data flows and assumes that operations inside each processing cell take zero processor interconnection network topology, local processor cell time. To achieve proper synchronisation unit delays are communication, synchronous operation and single instruction inserted. Input of data values is bounded by systolic cycle, stream applied to many data elements (instruction systolic arrays which is equal to multiple processor clock cycles; are excluded from this definition) [1]. To properly synchronise 4. Slow memory cycle times compared to processor cycle times; data flows within systolic array, unit delays driven by global clock many DSP systolic algorithms are very simple and thus cycle are inserted on PE communication data paths. Net exhibit short operation latencies, but new data value can not throughput of the array can be low compared to processor cell be fetched due to memory cycle constraints or is not available cycles time and is inherently limited by the data availability from the cell above, since it was not calculated yet. constraints. We can see that a common factor limiting throughput is Multithreading is a technique usually used to mask long availability of data. The obvious solution is thus the introduction memory or interprocessor communication latency times and also of interleaved independent data sets into systolic array. Let’s take to prevent data hazards within pipelines [2]. These problems a look at a simple case of systolic 1D convolution of which became more severe recently, since memory cycle times are not processing cell is shown in Figure 1. reducing with the same pace as processor cycles do. The idea is straightforward: when any kind of latency or interrupt occurs, switch to an independed of execution and do a back switch CH 1 CH 3 to previous instruction stream when the relevant data becomes r1 available. In this paper we apply multithreading principles to systolic/SIMD arrays in order to increase their net throughput and r3 explore architectural details of such a processor ensemble. Target × C applications and algorithms that can be run on our proposed architecture are from two domains: DSP and matrix algebra. The r4 first includes a set of systolised convolution, FIR, IIR, DFT, DCT CH 2 CH 5 and similar algorithms in 1D and 2D variants and the second r2 Σ r5 vector and matrix algebraic operations, linear equation solving and a modified Faddeev algorithm (FA) [3], which itself offers a Figure 1: Systolic cell for 1D convolution used in examples multitude of combinations of matrix operations. Additionally, bellow. Note register names included on data paths. every algorithm that can be systolised to 1D or 2D processing array can be executed on this PE array. The paper is organised as Execution of the algorithms is presented with Gantt charts. instruction every processor cycle. We decided to set a limit at two Cycles are processor cycles within each processing element. Note instructions issued every cycle. that this particular algorithm has 50 % efficiency, meaning that a dummy value is following each data element. PE cycle Figure 2 shows the assembly-coded algorithm running in each Instruction processing cell. All instructions are in the form: Mnemonic DST, 1 2 3 4 5 6 7 8 9 10 11 12 SRC1 and SRC2. Instructions rc2f (read two floating-point values rc2f r1,r2 x from channels) and wc2f (write read two floating-point values to fmul r4,r1,r3 y y y channels) take care of input and output respectively. Convolution fadd r5,r2,r4 z z z constant is stored in r3. We see that systolic cycle can not be wc2f r1,r5 u shorter than 8 processor cycles, assuming the latencies presented. rc2f r1,r2 u Dummy data can thus be input on cycle 9 (50 % efficiency) and fmul r4,r1,r3 x x x the next data value on cycle 17. Assume for the moment that our fadd r5,r2,r4 y y y processor works at 100 MHz, so the throughput of the whole wc2f r1,r5 z systolic array is only 6.25 million data values/s. We next apply rc2f r1,r2 z interleaving of independent data values - multithreading to this fmul r4,r1,r3 u u u problem. Each thread can have its own register file or alternatively can fadd r5,r2,r4 x x use a larger common one with appended tag bits differentiating wc2f r1,r5 y different threads within one register file. We are assuming that there are four independent data threads available, which can be Figure 3: Gantt chart of four threaded 1D convolution on freely interleaved to achieve minimal initiation interval. Observe linear systolic array; letters x, y, z, u denote data inputs from Figure 3 where instructions are issued every cycle, unlike the different threads; systolic cycle is four clock cycles long; note single threaded case where instruction issuance was dictated by that there are no data dependencies among successive RAW hazards. Note that our processor has still the same arithmetic operations, since data come from independent architecture and can issue only one instruction per cycle. threads. Instruction groups of four, which belong to the same systolic cycle, take data as operands belonging to different independent PE cycle threads. Fadd operation can thus start immediately after fmul, Instruction since their operands are from independent data sets. Throughput is 1 2 3 4 5 6 7 8 9 10 11 greatly enhanced. Four data values are processed within 16 rc2f r1,r2 x processor cycles. The same number of cycles is needed in classic fmul r4,r1,r3 x xx case for only one data value. Lower inherent algorithm efficiency fadd r5,r2,r4 x xx was here eliminated completely, since dummy values are treated wc2f r1,r5 x as belonging to different data set. For the same implementation rc2f r1,r2 y assumptions we get the net throughput of 16 cycles/4 data values fmul r4,r1,r3 y yy at 100 MHz, which is 25 million data values/s. This is 300% fadd r5,r2,r4 y y y higher than in previous case. There is another point worth wc2f r1,r5 y mentioning: pipelining of functional units. This feature is not rc2f r1,r2 z desirable in classic systolic arrays, since it lengthens latency and fmul r4,r1,r3 z zz thus systolic cycle (Figure 2). Multithreading favours pipelined fadd r5,r2,r4 z z z functional units in order to shorten processor clock cycle and wc2f r1,r5 z increase the net throughput. rc2f r1,r2 u fmul r4,r1,r3 u uu PE cycle fadd r5,r2,r4 u uu Instruction wc2f r1,r5 u 1 2 3 4 5 6 7 8 9 10 11 12 rc2f r1,r2 x Figure 4: Gantt chart for an alternative four threaded 1D fmul r4,r1,r3 x x x convolution on linear systolic array; letters x, y, z, u denote fadd r5,r2,r4 x x x data inputs from different threads; average systolic cycle is 1.8 wc2f r1,r5 x clock cycles (initiations 1,1,1,1,5); note also multi-issue from rc2f r1,r2 y cycle two onwards (bold). This is the case of simultaneous fmul r4,r1,r3 yyy multithreading [4].

Figure 2: Gantt chart for 1D convolution on linear systolic Number of functional units stays the same. Let us take a look at a array; length of systolic cycle is 8 clock cycles, second different arrangement of multithreaded execution of the same iteration, i.e. next data input (although dummy) can start on algorithm in Figure 4. Instructions bounded for different clock cycle 9 (denoted by bold y letters); note the inherent functional units are issued simultaneously (denoted by bold read-after-write hazard between multiplication and addition letters) on cycles 2, 3, 4, 5, 8. We also assumed that each instructions. functional unit has its own port to common register file. On cycle 7, for example adder and multiplier finished execution and if only Contemporary RISC processors are able to issue and execute one result bus is available store of either result must be stalled for several instructions simultaneously. Can multithreaded systolic one cycle. Alternatively, each thread can be provided with its own arrays benefit from multi-instruction issue? In order to study these register file as already mentioned. This kind of multithreading, effects our implementation must allow the issue of more than one coupled with multiple instruction issue from different threads is file through a 128-bit wide bus via the same four r/w ports. Each also called simultaneous multithreading. On each cycle, two of the register files has thus eight ports and 16 32-bit registers. instructions from different threads are issued simultaneously, Figure 6 shows the PE. More information on the internals of the although there are as much as four instructions simultaneously in VeMUS2DAP can be found in [5]. different execution stages (cycles 4, 5, 6, 7, 8). What is the net performance gain? Four data values are processed every 11 cycles, BANK 4 BANK 3 BANK 2 BANK 1 BANK which gives a net throughput of 36.36 million data values/s or 481 % rise compared to classic and 45 % to single-issue multithreaded systems, respectively. Note that multi-instruction issue does not anything in classic systolic case due to inherent RAW hazards among instructions (functional units). 128 32 Factors not considered here are proper instruction scheduling INTER PE 32 and data arrangement. In systolic arrays data flow must be CONNECT precisely synchronised as required by the algorithm. Instruction 32 32 32 32 32 32 32 32 32 issuing mechanism must therefore never schedule an instruction 32 for which no data values are currently available. It must be I RF FP RF deterministic in a sense that it guaranties, that instructions from 16 × 32 16 × 32 different threads will always be scheduled and issued in the same order. This strictly applies to multi-issue multithreading cases. 32 Data flows on the other hand must be 'prepared' in advance in the same manner, i.e. data values from different threads are positioned contiguously in accordance with instruction issuing.

3. PROCESSOR ARRAY ARCHITECTURE: VeMUS2DAP MAF The VeMUS2DAP array processor consists of two parts: multi- issue controller and mesh connected PE array. Multi-issue 32 controller steps through the program, executes all scalar operations, branches and issues decoded array instructions (scalar COMMON RESULT BUS or vector) to PEs. Employing multi-issue controller solves two Figure 6: Internal PE structure; Note the replicated register problems. First, most matrix algebra systolic algorithms require file, one for each thread, which can be run simultaneously on the execution of two different programs on the same PE array. the PE array and a combined I & FP multiply/add fused unit Multi-issue controller is capable of issuing instructions from (MAF). different instruction streams and direct control to two different parts of PE array as required by systolic algorithm. The second is the enhancement of throughput of PE array is possible by 1 MB Controller LM multithreading several independent data streams of the same algorithm on one array. It is worth mentioning that the same multi- . . . 32 KB I $ 32 KB D $ issue controller is also used to run different algorithms (threads) BANK 1 BANK 4 simultaneously and they compete for the resources within scalar PE FOUR-ISSUE VLIW/RISC and PE portions of the system. Figure 5 shows instruction slots for ↔ Controller the two cases. ROUTING GM

PE1,1 . . . PE4,1 BANK 1 Thread #1 #2 ... Inst. #1 #2 #3 #4 ......

Control of PE11 - PE44 PE15 - PE48

PE1,8 PE4,8 BANK 8 Thread #1 #2 #3 #4 . . .

PE ↔ GM ROUTING Inst. #1 #2 #3 #4 Figure 7: Layout of the 32 PE array with two global memory Control of whole PE array arrays, each having 28 Mb and 56 Mb capacity, respectively. Each bank has 7Mb with 128 bits interface. Controller is Figure 5: Multi instruction issue slot issuing two instructions augmented with 32 KB split caches and a large local memory. from two different threads for different portions of PE array and interleaving four threads within issue slot to the whole PE 4. SINGLE-CHIP IMPLEMENTATION array. Recent studies on semiconductor integration trends have shown that by year 2003 more than 100 million transistors will be Inter PE communication is done via four bi-directional 32-bit available for single chip implementations, and by 2010 even buses which are directly tied to four neighbour PEs and additional around billion [6]. This initiated different design projects on how four r/w register file ports. Local memory is connected to register to most effectively employ such huge resources. There seems to exist a consensus among different processor designers that a major part of the future Si dies will be devoted to different memories, present and in the near future unachievable or would present a but processor granularity levels vary widely. Single processor very costly solution. implementations are not interesting to us, but there are similar Our goal is to design a single-chip array processor system multiprocessor projects on the way, as IRAM, chip complete with controller, 2D PE array and enough data memory. multiprocessors, computational RAM, PPRAM, etc [7]. There are Figure 7 outlines its possible layout. PE array is organised as an 8- technology trade-offs that have to be faced with these approaches, by-4 mesh. The reason for non-rectangular array is the namely the cohesion of rather slow, but highly dense DRAM requirement of several matrix algorithms like FA, QR non- with very fast, but with 3-5 times lower integration logic rectangular topology to run efficiently. FA requires two process. It can be expected that such processor designs will run at seamlessly connected arrays: one rectangular n by n and one lower speeds, but on chip parallelism will provide performance triangular n by n. Observe two global memory (GM) arrays gains well beyond single processor implementations. (although they reside in the same address space), which is again Our project is proposing a single chip integration of the whole beneficial for problems larger than the PE array size and facilitates systolic/SIMD array processor with large amounts of memory, efficient problem partitioning and buffering of intermediate which either acts as a global data pool (systolic mode) or is locally results. Dynamic RAM is used for LM to reduce size and its available to each PE (SIMD mode). The main obstacle to 2D longer cycle times are largely compensated by 128-bit wide access integration of an expandable systolic or SIMD arrays onto single paths and interleaving. Performance penalty is not likely to be chip, which can be expandable, is their high pin count demand. high due to vector nature of data sets. GM has 84 Mb capacity and Consider a 2D 8-by-8 PE array, with 32-bit inter PE paths and is also dynamic, partitioned into two large arrays. They are required expandability by interconnection of multiple such chips subdivided into four and eight 128 bits wide banks, respectively. to form larger PE arrays. Pin count in this case reaches 4 sides × 8 GM capacity is large enough to accept 64-bit data in form of four 2 2 PE/side × 32 pin/port = 1024 pins. Note that we disregarded input 512 matrices and one 512 output matrix as required by FA. additional connection requirements to a large memory pool, which For fast refill of data to GM from larger off chip memories there is is necessary for systolic arrays. At least 200 additional pins must a 16-bit wide RAMBUS interface cycling at 500 MHz. The whole be added for controller connection to program memory and GM can theoretically be refilled in approximately 0.01 s with 9 11 various handshake signals for its communication with host. Power throughput of 8*10 b/s vs. 8.2*10 b/s needed by the PEs, which connections to such chip would add additional 30-40% to existing are running at 200 MHz. number for a grand total of approximately 1700 pins. This is at Table 1 summarises the transistor requirements for such a system. The chip contains 124 million transistors and of these less than 10 million are used in logic and the rest in memory arrays. Controller core Controller I cache Controller D cache Controller MB) (1 I&D memory Controller local with 4 register files PE MAF unit logic PE LM + mux logic mux + GM interfaces I/O

ITEM & COMPLEXITY

# of transistors 5*105 1.8*106 1.8*106 8.5*106 2.5*105 6*105 84*106 2*105 # of items 1 1 1 1 32 32 1 1 Sum 5*105 1.8*106 1.8*106 8.5*106 8*106 19.2*106 84*106 2*105 Grand total 124 million transistors

Table 1: Complexity of the proposed system; Note that 83% of transistors is taken up by memories.

5. CONCLUSION 6. REFERENCES The proposed integration of a large PE array with sufficient [1] S.Y. Kung, VLSI Array Processor, Prentice Hall, 1988. amounts of large data memories is a possible answer on how to [2] Multithreaded : A summary of the efficiently employ millions of transistors, which will be available state of the art, ed. R. Iannucci, Kluwer AP, 1994. in a couple of years. Furthermore, its design is much simpler [3] J. G. Nash, S. Hansen, “Modified Faddeeva Algorithm for compared to contemporary wide-issue superscalar RISC Concurrent execution of Linear Algebraic Operations”, processors due to replicated data paths and simple controller IEEE Trans. on Computers, 37 (2), pp. 129-136, 1988. architecture. Nevertheless that multi-issue logic is required within [4] J. L. Lo, “Compilation issues for simultaneous controller, only independent threads of control are issued in one multithreaded processor”, 1st SUIF Compiler Workshop, Jan. slot, which greatly reduces dependency resolution logic design. 1996, pp. 146-147. DSP algorithms and tasks require large data vectors or matrices, [5] R. Sernec, M. Zajc, J. Tasic, “Design trade-offs of a parallel which can be simply manipulated with FA amenable PE array. DSP architecture: Combining instruction and data level Expansion of the PE array is possible, if instruction bus of issued parallelism”, sent for publication to WDTA'98, Dubrovnik, (but not yet decoded) instructions is brought to I/O pins. Croatia. Instruction bus can then be connected to other chips and work in [6] The National Technology Roadmap for Semiconductors, synchronised lock step fashion, although operating frequency SIA, San Jose, USA, 1997. would be lower in this case [8]. [7] IEEE Computer, The Future of , Sept. 1997 [8] http://www.bops.com/