Single Chip Dsp Array Processor: 100 Million + Transistors with Multithreading Approach

SINGLE CHIP DSP ARRAY PROCESSOR: 100 MILLION + TRANSISTORS WITH MULTITHREADING APPROACH Radovan Sernec1 Matej Zajc2 -XULM)7DVLþ2 1BIA Ltd., Teslova 30, 1000 Ljubljana, Slovenia [email protected] 2)DNXOWHWD]DHOHNWURWHKQLNR7UåDãND/MXEOMDQD6ORYHQLD [email protected] ABSTRACT follows: firstly a case study of systolic 1D convolution and its We propose an efficient programmable parallel architecture for transformation to multithreaded case is presented, followed by DSP and matrix algebra applications that can exploit parallelism at processor architecture overview and possible implementation algorithm (topology) level via systolic/SIMD array processing and combining processor array with data memories on the same chip. at instruction level via multiple-issue control processor capable of multithreading. Our premise is: »One array – one chip« and 2. MULTITHREADING IN SYSTOLIC/SIMD ARRAYS integration of systolic/SIMD processing on the same processor Let’s analyse why systolic algorithms can not be fully array with required data storage. Multithreading on systolic/SIMD pipelined on processor cycle level and thus achieve a throughput arrays is analysed on examples, which show that substantial of 1 result/processor clock cycle. By systolic cycle we mean a speedups are possible (100% - 800%) when up to four threads are global clock cycle used to move data elements through the systolic interleaved on cycle-by-cycle basis. We are targeting processor array and all interprocessor element communication registers are element (PE) granularities in word range and employ support for synchronised to this clock. The throughput can be limited due to floating-point operations. Furthermore the array integrates data four different reasons: memory at two levels of hierarchy: local per PE (SIMD) and 1. Lower than 100 % efficiency of a given systolic algorithm; global for the whole processing array (systolic). Complexity of when input data are spaced with dummy values in order to such a system is explored in detail and is shown that 32 PE array satisfy proper data synchronisation within array. If after each can be implemented on a 120 million transistor chip. input data element a dummy one follows, efficiency drops to 50 %. 1D convolution algorithm studied bellow is such a 1. INTRODUCTION case. Systolic arrays are efficient parallel architectures used in digital 2. Read-after-write hazards within processing cell. signal processing, for solving linear algebra algorithms and other 3. Long latency of operations; notation of systolic algorithms problems. They are exploiting regularity in data flows and assumes that operations inside each processing cell take zero processor interconnection network topology, local processor cell time. To achieve proper synchronisation unit delays are communication, synchronous operation and single instruction inserted. Input of data values is bounded by systolic cycle, stream applied to many data elements (instruction systolic arrays which is equal to multiple processor clock cycles; are excluded from this definition) [1]. To properly synchronise 4. Slow memory cycle times compared to processor cycle times; data flows within systolic array, unit delays driven by global clock many DSP systolic algorithms are very simple and thus cycle are inserted on PE communication data paths. Net exhibit short operation latencies, but new data value can not throughput of the array can be low compared to processor cell be fetched due to memory cycle constraints or is not available cycles time and is inherently limited by the data availability from the cell above, since it was not calculated yet. constraints. We can see that a common factor limiting throughput is Multithreading is a technique usually used to mask long availability of data. The obvious solution is thus the introduction memory or interprocessor communication latency times and also of interleaved independent data sets into systolic array. Let’s take to prevent data hazards within pipelines [2]. These problems a look at a simple case of systolic 1D convolution of which became more severe recently, since memory cycle times are not processing cell is shown in Figure 1. reducing with the same pace as processor cycles do. The idea is straightforward: when any kind of latency or interrupt occurs, switch to an independed thread of execution and do a back switch CH 1 CH 3 to previous instruction stream when the relevant data becomes r1 available. In this paper we apply multithreading principles to systolic/SIMD arrays in order to increase their net throughput and r3 explore architectural details of such a processor ensemble. Target × C applications and algorithms that can be run on our proposed architecture are from two domains: DSP and matrix algebra. The r4 first includes a set of systolised convolution, FIR, IIR, DFT, DCT CH 2 CH 5 and similar algorithms in 1D and 2D variants and the second r2 Σ r5 vector and matrix algebraic operations, linear equation solving and a modified Faddeev algorithm (FA) [3], which itself offers a Figure 1: Systolic cell for 1D convolution used in examples multitude of combinations of matrix operations. Additionally, bellow. Note register names included on data paths. every algorithm that can be systolised to 1D or 2D processing array can be executed on this PE array. The paper is organised as Execution of the algorithms is presented with Gantt charts. instruction every processor cycle. We decided to set a limit at two Cycles are processor cycles within each processing element. Note instructions issued every cycle. that this particular algorithm has 50 % efficiency, meaning that a dummy value is following each data element. PE cycle Figure 2 shows the assembly-coded algorithm running in each Instruction processing cell. All instructions are in the form: Mnemonic DST, 1 2 3 4 5 6 7 8 9 10 11 12 SRC1 and SRC2. Instructions rc2f (read two floating-point values rc2f r1,r2 x from channels) and wc2f (write read two floating-point values to fmul r4,r1,r3 y y y channels) take care of input and output respectively. Convolution fadd r5,r2,r4 z z z constant is stored in r3. We see that systolic cycle can not be wc2f r1,r5 u shorter than 8 processor cycles, assuming the latencies presented. rc2f r1,r2 u Dummy data can thus be input on cycle 9 (50 % efficiency) and fmul r4,r1,r3 x x x the next data value on cycle 17. Assume for the moment that our fadd r5,r2,r4 y y y processor works at 100 MHz, so the throughput of the whole wc2f r1,r5 z systolic array is only 6.25 million data values/s. We next apply rc2f r1,r2 z interleaving of independent data values - multithreading to this fmul r4,r1,r3 u u u problem. Each thread can have its own register file or alternatively can fadd r5,r2,r4 x x use a larger common one with appended tag bits differentiating wc2f r1,r5 y different threads within one register file. We are assuming that there are four independent data threads available, which can be Figure 3: Gantt chart of four threaded 1D convolution on freely interleaved to achieve minimal initiation interval. Observe linear systolic array; letters x, y, z, u denote data inputs from Figure 3 where instructions are issued every cycle, unlike the different threads; systolic cycle is four clock cycles long; note single threaded case where instruction issuance was dictated by that there are no data dependencies among successive RAW hazards. Note that our processor has still the same arithmetic operations, since data come from independent architecture and can issue only one instruction per cycle. threads. Instruction groups of four, which belong to the same systolic cycle, take data as operands belonging to different independent PE cycle threads. Fadd operation can thus start immediately after fmul, Instruction since their operands are from independent data sets. Throughput is 1 2 3 4 5 6 7 8 9 10 11 greatly enhanced. Four data values are processed within 16 rc2f r1,r2 x processor cycles. The same number of cycles is needed in classic fmul r4,r1,r3 x xx case for only one data value. Lower inherent algorithm efficiency fadd r5,r2,r4 x xx was here eliminated completely, since dummy values are treated wc2f r1,r5 x as belonging to different data set. For the same implementation rc2f r1,r2 y assumptions we get the net throughput of 16 cycles/4 data values fmul r4,r1,r3 y yy at 100 MHz, which is 25 million data values/s. This is 300% fadd r5,r2,r4 y y y higher than in previous case. There is another point worth wc2f r1,r5 y mentioning: pipelining of functional units. This feature is not rc2f r1,r2 z desirable in classic systolic arrays, since it lengthens latency and fmul r4,r1,r3 z zz thus systolic cycle (Figure 2). Multithreading favours pipelined fadd r5,r2,r4 z z z functional units in order to shorten processor clock cycle and wc2f r1,r5 z increase the net throughput. rc2f r1,r2 u fmul r4,r1,r3 u uu PE cycle fadd r5,r2,r4 u uu Instruction wc2f r1,r5 u 1 2 3 4 5 6 7 8 9 10 11 12 rc2f r1,r2 x Figure 4: Gantt chart for an alternative four threaded 1D fmul r4,r1,r3 x x x convolution on linear systolic array; letters x, y, z, u denote fadd r5,r2,r4 x x x data inputs from different threads; average systolic cycle is 1.8 wc2f r1,r5 x clock cycles (initiations 1,1,1,1,5); note also multi-issue from rc2f r1,r2 y cycle two onwards (bold). This is the case of simultaneous fmul r4,r1,r3 yyy multithreading [4]. Figure 2: Gantt chart for 1D convolution on linear systolic Number of functional units stays the same.

Load more