FOCUS DSPs Processors Programmability with Effi ciency

WILLIAM J. DALLY, UJVAL J. KAPASI, BRUCEK KHAILANY, JUNG HO AHN, AND ABHISHEK DAS,

any signal processing applications require both effi ciency and programmability. Base- band signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per Msecond) with a power budget of a few watts, an effi ciency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programma- bility is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated opti- cal inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs. Conventional signal processing solutions can provide high effi ciency or programmability, but are unable to pro- vide both at the same time. In applications that demand effi ciency, a hardwired application-specifi c — ASIC (application-specifi c ) or ASSP (application-specifi c standard part)—has an effi ciency of 50 to 500 GOPS/W, but offers little if any fl exibility. At the other extreme, and DSPs (digital

52 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 53

QUEUE QUEUE Will this new kid on the block Stream muscle out ASIC and DSP?

Programmability with Effi ciency

52 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 53

QUEUE QUEUE FOCUS DSPs

with each other to extract the depth at each pixel of the image. Along the top path, a stream of pixels from the left Stream image is fi ltered by a Gaussiankernel to reject high-fre- quency noise, generating a stream of smoothed pixels. This stream is fi ltered by a Laplacian kernel to highlight Processors edges, generating a stream of edge-enhanced pixels. The right image follows a similar fi ltering path. The SAD Programmability with Effi ciency (sum of absolute differences) kernel then compares a 7x7 sub-image about each pixel in the fi ltered left image with signal processors) are completely programmable but have a row of 7x7 sub-images in the fi ltered right image to effi ciencies of less than 10 GOPS/W. DSP (digital signal fi nd the best match. The position of the best match gives processor) arrays and FPGAs (fi eld-programmable gate the disparity between the two images at that pixel, from arrays) offer higher performance than individual DSPs, which we can derive the depth of the pixel. but have roughly the same effi ciency. Moreover, these The stream program of fi gure 1 exposes both parallel- solutions are diffi cult to program—requiring paralleliza- ism and locality. Each element of each input stream (all of tion, partitioning, and, for FPGAs, hardware design. the image pixels) can be processed simultaneously, expos- Applications today must choose between effi ciency ing large amounts of . This parallelism is and programmability. Where power budgets are tight, particularly easy to identify in a stream program because effi ciency is the choice, and the signal processing is the structure of the program makes the dependencies implemented with an ASIC or ASSP, giving up program- between kernels explicit. The complex disambiguation mability. With wireless communications systems, for required when intermediate data is passed through mem- example, this means that only a single air interface can be ory arrays is not needed. Within each kernel, instruc- supported or that a separate ASIC is needed for each air tion-level parallelism is exposed since many independent interface, with a static partitioning of resources (ASICs) operations can execute in parallel. Finally, the kernels between interfaces. can operate in parallel, operating on pixels or frames in a Stream processors are signal and image processors that pipelined manner to expose -level parallelism. offer both effi ciency and programmability. Stream proces- The stream program also exposes two types of local- sors have effi ciency comparable to ASICs (200 GOPS/W), ity: kernel and producer-consumer. During the execution of while being programmable in a high-level language. a kernel, all references are to variables local to the kernel except for values read from the input stream(s) and writ- ten to the output stream(s). This is kernel locality. Consider A stream program (some- times called a synchro- nousEXPOSING data-fl PARALLELISMow program) AND LOCALITY expresses a computation as a signal fl ow graph with streams of records (the image 0 Gaussian Laplacian filtered edges) fl owing between image 0 computation kernels (the nodes). Most signal-pro- SADD depth map cessing applications are naturally expressed in this image 1 Gaussiann Laplacian filtered style. For example, fi gure image 1 1 shows a stream program that performs stereo depth extraction based on the algorithm of Kanade.1 In this application, a stereo pair of images are fi rst fi ltered and then compared FIGFIG 1 54 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 55

QUEUE QUEUE one implementation of the 7x7 convolution kernel. microprogram broadcasting VLIW (very-long instruction The inputs and outputs for the operations in the kernel word) instructions across the clusters until the kernel is require 176 references to values in local register fi les for completed for all records in the current block. every word accessed from the SRF (stream register fi le). A large number, C×A, of arithmetic units in a stream Thus, because of kernel locality, we are able to ensure that processor exploit the parallelism of a stream program. A one out of every 177 references is local to the kernel. stream processor exploits data parallelism by operating Producer-consumer locality is exposed by the streams on C stream elements in parallel, one on each cluster, fl owing between kernels. As one kernel produces stream under SIMD (single-instruction, multiple-data) control of elements, the next kernel consumes these elements in the . The instruction-level parallelism of sequence. By appropriately sequencing kernels, the ele- a kernel is exploited by the multiple arithmetic units in ments of an intermediate stream can be kept local—val- each cluster that are controlled by the VLIW instructions ues are consumed soon after they are produced. Each issued by the microcontroller. If needed, thread-level time the Gaussian kernel in fi gure 1 generates a block of parallelism can be exploited by operating multiple stream the smoothed pixel stream, for example, this block can execution units in parallel. Research has shown that be consumed by the Laplacian kernel. Only a block of the typical stream programs have suffi cient data and instruc- intermediate stream exists at any point in time, and this tion-level parallelism for media applications to keep more block can be kept in local storage. than 1,000 arithmetic units productively employed.2 The exposed register hierarchy of the stream processor exploits the locality of a stream program. Kernel local- As shown in the block diagram of fi gure 2, a stream ity is exploited by keeping almost all kernel variables in processor consists of a , a stream memory local register fi les immediately adjacent to the arithmetic system,EXPLOITING an I/O PARALLELISM system, and aAND stream LOCALITY , units in which they are to be used. These local register which consists of a microcontroller and an array of C fi les provide very high bandwidth and very low power arithmetic clusters. Each cluster contains a portion of the for accessing local variables. Producer-consumer local- SRF, a collection of A arith- metic units, a set of local register fi les, and a local switch. A local register fi le scalar processor microcontroller is associated with each arithmetic unit. A global switch allows the clusters to exchange data. LRF A stream processor executes an instruction DRAM SRF CL set extended with kernel bank lane SW execution and stream LRF load and store instruc- tions. The scalar processor fetches all instructions. It executes scalar instructions itself, dispatches kernel

execution instructions to LRF 2 the microcontroller and

arithmetic clusters, and

DRAM and I/O system SRF CL dispatches stream load and bank lane SW store instructions to the G memory or I/O system. LRF

For each kernel execution I instruction, the microcon-

troller starts execution of a 2 FIG F

54 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 55

QUEUE QUEUE FOCUS DSPs

block from the SRF and writing each pixel of the output block to the SRF. Most local variables for the kernels are Stream kept in the local register fi les with some partial products cycled through the SRF. Overall, for each word accessed from memory or I/O, 23 words are referenced from the Processors SRF, and 317 are referenced from local registers. This high fraction of references from local registers is Programmability with Effi ciency not unique to the depth extractor. Figure 4 shows that kernel locality and producer-consumer locality exist in a ity is exploited via the SRF. A producing kernel, such as broad range of applications and that a stream processor the Gaussian kernel in fi gure 1, generates a block of an can successfully exploit this locality. The fi gure shows the intermediate stream into the SRF, each cluster writing to bandwidth from main memory, the SRF, and local register its local portion of the SRF. A consuming kernel, such as fi les for six applications. The fi rst column (depth) shows the Laplacian kernel in fi gure 1, then consumes the block the bandwidth for the depth extractor of fi gure 1. MPEG of stream elements directly from the SRF. is an MPEG2 encoder including motion estimation. QRD To see how parallelism and locality are exploited in is a QR decomposition using the Householder method. practice, fi gure 3 shows how the depth extraction pro- STAP (space-time adaptive processing) is an adaptive gram of fi gure 1 is mapped to the stream processor of fi g- beam-forming application. Render is an OpenGL 3D ure 2. The input images are read from external memory or graphics-rendering program. RTSL (realtime shading lan- an I/O device into the SRF one block at a time, then the guage) is a renderer with a programmable shader.3 For all Gaussian kernel is run. Each cluster performs the kernel of these programs, more than 95 percent of all references on a different pixel of the input, reading each pixel of the are from the local registers and less than 0.5 percent are from external memory. While a conventional or DSP can benefi t from the locality row of pixels and parallelism exposed previous partial sums convolution by a stream program, it is (Gaussian) new partial sums unable to fully realize the parallelism and locality of streaming. A conventional blurred row processor has only a few previous partial sums convolution (typically fewer than four, (Laplacian) compared with hundreds new partial sums for a stream processor) arithmetic units and thus sharpened row is unable to exploit much of the parallelism exposed filtered row segment by a stream program. A filtered row segment conventional processor is unable to realize much previous partial sums SAD kernel locality because new partial sums it has too few processor registers (typically fewer depth map row segment than 32, compared with thousands for a stream processor) to capture the working set of a kernel. A processor’s cache memory FIG 3 is unable to exploit much 56 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 57

QUEUE QUEUE of the producer-consumer locality because there is Table 1, for a 0.13µm (micrometer) operating little reuse of consumed data (the data is read once and from a 1.2V supply, a simple 32-bit RISC processor con- discarded). Also, a cache is reactive, waiting for the data sumes 500 pJ to perform an instruction,4 whereas a single to be requested before fetching it. In contrast, data is pro- 32-bit arithmetic operation requires only 5 pJ. Only 1 actively fetched into an SRF so it is ready when needed. percent of the energy consumed by the instruction is used Finally, a cache replaces data without regard to its live- to perform arithmetic. The remaining 99 percent goes to ness (using a least-recently used or random replacement overhead. This overhead is divided between instruction strategy) and often discards data that is still needed. In overhead (reading the instruction from a cache, updating contrast, an SRF is managed by a compiler in such a man- the , instruction decode, transmitting the ner that only dead data (data that is no longer of interest) instruction through a series of registers, etc.) and is replaced to make room for new data. data overhead (reading data from a cache, reading and writing a multiport register fi le, transmitting operands and intermediate results through pipeline registers and Most of the energy consumed by a modern microproces- bypass , etc.). sor or DSP is consumed by data and instruction move- A stream processor exploits data and instruction ment,EFFICIENCY not by performing arithmetic. As illustrated in locality to reduce this overhead so that approximately 30 percent of the energy is consumed by arithmetic opera- tions. On the data side, the locality shown in fi gure 4 keeps most data movements over short wires, consuming little energy. The distributed register organization with a

m, 1.2V) number of small local register fi les connected by a cluster TABLEµ Operation Energy 1 switch is signifi cantly more effi cient than a single global 32-bit arithmetic operation 5 pJ register fi le.5 Also, the SRF is accessed only once every 20 32-bit register read 10 pJ operations on average, compared with a data cache that is 32-bit 8KB RAM read 50 pJ accessed once every three operations on average, greatly 32-bit traverse 10mm wire 100 pJ reducing memory access energy. On the instruction side, the energy required to read a microinstruction from the Execute instruction 500 pJ Energy Per Operation (0.13 Operation Per Energy memory is amortized across the data parallel clusters of a stream proces- sor. Also, kernel microin- structions are simpler and hence have less control overhead than the RISC instructions executed by the scalar processor.

A stream processor time- multiplexesTIME VERSUS its SPACE hardware overMULTIPLEXING the kernels of an application. All of the clus- ters work together on one kernel—each operating on different data—then they all proceed to the next kernel, and so on. This is shown on the left side of fi gure 5. In contrast, many FIGFIG 4 tiled architectures (DSP 56 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 57

QUEUE QUEUE FOCUS DSPs

always perfectly balanced. This often results in a 2x to 3x improvement in effi ciency. Stream Exploiting data parallelism rather than thread-level parallelism, a time-multiplexed architecture uses its instruction bandwidth more effi ciently. Fetching an Processors instruction is costly in terms of energy. The instruction pointer is incremented, an instruction cache is accessed, Programmability with Effi ciency and the instruction must be decoded. The energy required to perform these operations often exceeds the energy per- arrays) are space-multiplexed. Each kernel runs continu- formed by the arithmetic carried out by the instruction. ously on a different tile, processing the data stream in On a space-multiplexed architecture, each instruction is sequence, as shown on the right side of fi gure 5. The used exactly once, and thus this instruction cost is added clusters of a stream processor exploit data parallelism, directly to the cost of each instruction. On a time-mul- whereas the tiles of a DSP array exploit thread-level tiplexed architecture, however, the energy cost of an parallelism. instruction is amortized across the parallel clusters that all Time multiplexing has two signifi cant advantages execute the same instruction in parallel. This results in an over space multiplexing: load balance and instruction additional 2x to 3x improvement in effi ciency. effi ciency. As shown in fi gure 5, with time multiplexing the load is perfectly balanced across the clusters—all of the clusters are busy all of the time. With space multi- Mapping an application to a stream processor involves plexing, on the other hand, the tiles that perform shorter two steps: kernel scheduling, in which the operations of kernels are idle much of the time as they wait for the eachSTREAM kernel PROGRAMMING are scheduled on TOOLS the arithmetic units of a tile running the longest kernel to fi nish. The load is not cluster; and stream scheduling, in which kernel executions balanced across the tiles: Tile 0 (the bottleneck tile) is and data transfers are scheduled to use the SRF effi ciently busy all of the time, while the other tiles are idle much of and to maximize data locality. We have developed a set of the time. Particularly when kernel execution time is data programming tools that automate both of these tasks so dependent (as with many compression algorithms), load that a stream processor can be programmed entirely in C balancing a space-multiplexed architecture is impossible. without sacrifi cing effi ciency. A time-multiplexed architecture, on the other hand, is Our kernel scheduler takes a kernel described in kernel C and compiles it to a VLIW microprogram. This compilation uses communication scheduling6 to map each operation to a cycle number and arithmetic unit, and simultaneously schedule data movement necessary to provide operands. The compiler software pipelines inner loops, converting data parallelism to instruction-level K1 K1 K1 K1 K1 parallelism where it is required to keep all operation units busy. To handle conditional (if-then-else) structures across K2 K2 K2 K2 the SIMD clusters, the compiler uses predication and K1 conditional streams.7 Figure 6 shows the schedule for the K3 K3 K3 K3 K3 7x7 convolution kernel from the depth extractor of fi gure K2 K4 K4 K4 K4 K1 K4 2 compiled to the Imagine stream processor (described K3 later). Time is shown on the vertical axis and function K2 units on the horizontal axis. The kernel scheduler is able K1 K4 K3 to keep the multipliers (columns 4 and 5) busy nearly K2 every cycle. K4 The stream scheduler schedules not only the transfers K3 of blocks of streams between memory, I/O devices, and

K4 the SRF, but also the execution of kernels. This task is comparable to scheduling DMA () FIGFIG 5 transfers between off-chip memory and I/O that must be 58 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 59

QUEUE QUEUE performed manually for most conventional DSPs. The references and does little work with each result fetched, stream scheduler accepts a C++ program and outputs for example, would be limited by global bandwidth and machine code for the scalar processor including stream not benefi t from streaming. Happily, most signal pro- load and store, I/O, and kernel execution instructions. cessing applications have adequate data parallelism and The stream scheduler optimizes the block size so that the locality. largest possible streams are transferred and operated on at Even for those applications that do stream well, inertia a time, without overfl owing the capacity of the SRF. This represents a signifi cant barrier to the adoption of stream optimization is similar to the use of cache blocking on processors. Though it is easy to program a stream proces- conventional processors and to stripmining of loops on vector processors. Figure 7 shows a stream schedule for an 7x7 Convolution Kernel From Depth ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 OpenGL polygon renderer. Time is shown vertically and Extraction Application space in the SRF horizontally. Single Iteration Schedule

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

PASS IADDS16 IMULRND16 IMULRND16 NSELECT PASS

IADDS16 IADDS16 SHIFTA16 IMULRND16 IMULRND16 PASS NSELECT

IADDS16 IADDS16 IMULRND16 IMULRND16 GE NC IS TA TE

SHUFFLE IADDS16 IADDS16 IMULRND16 IMULRND16 CO ND _I N_ D

GE N_ CC EN D SHUFFLE IMULRND16 IMULRND16 PASS 8 SP CR EAD _W T SPC WR IT E Imagine, shown in fi gure 8, is a prototype stream pro- IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 PASS CO MM UC DA TA CHK _A NY IADDS16 IADDS16 IMULRND16 IMULRND16

IADDS16 SHUFFLE IADDS16 IMULRND16 IMULRND16 PASS SE LE CT

IADDS16 SHUFFLE IMULRND16 IMULRND16

IADDS16 SHUFFLE SHUFFLE IMULRND16 IMULRND16 cessor fabricated in a 0.18µm CMOS process. Imagine SH IF TA 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 COMMUCPERM

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 COMMUCPERM

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 COMMUCPERM

THE IMAGINE STREAM PROCESSOR IADDS16 IADDS16 IMULRND16 IMULRND16 PASS PASS contains eight arithmetic clusters, each with six 32-bit SHUFFLE SELECT PASS

SHUFFLE SHUFFLE SHUFFLE SELECT COMMUCPERM

CO MM UC PE RM IADDS16 IADDS16 IADDS16 IMULRND16 SELECT PASS

CO MM UC PE RM

IADDS16 IADDS16 IADDS16 PASS IMULRND16 PASS PASS CO MM UC PE RM

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT

fl oating-point arithmetic units: three adders, two multi- SE LE CT IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 SE LE CT CO MM UC PE RM

IMULRND16 IMULRND16 PASS IM UL RN D 1 6 SE LE CT

IM UL RN D 1 6 SHUFFLE SHUFFLE SHUFFLE IMULRND16 IMULRND16 DATA_IN PASS

IM UL RND 1 6 IM UL RN D 1 6 NS ELEC T

IADDS16 SHUFFLE IADDS16 IMULRND16 IMULRND16 PASS DATA_IN PASS GEN_CISTATE IM UL RN D 1 6 IM UL RN D 1 6

pliers, and one divide-square root (DSQ) unit. With the IM UL RN D 1 6 IM UL RN D 1 6 PA SS IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 PASS COND_IN_D DATA_IN PASS

IM UL RN D 1 6 IM UL RN D 1 6 PA SS IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_IN GEN_CCEND IM UL RND 1 6 IM UL RN D 1 6 PA SS

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_OUT SPCREAD_WT SPCWRITE SELECT IM UL RN D 1 6 IM UL RN D 1 6

IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_IN DATA_OUT COMMUCDATA

IM UL RN D 1 6 IM UL RN D 1 6

exception of the DSQ unit, all units are fully pipelined IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_IN DATA_OUT CHK_ANY IM UL RND 1 6 IM UL RN D 1 6

IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_OUT NSELECT

IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_OUT NSELECT LOOP IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 NS ELEC T

IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_OUT DATA_OUT PASS IA DD S 1 6 IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6 SE LE CT PA SS

PA SS IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 NS ELEC T PA SS

and support 8-, 16-, and 32-bit integer operations, as IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS NS ELEC T

IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

SH UF FL E IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6 PA SS

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS

IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

well as 32-bit fl oating-point operations. Each input of Software Pipelining IA DDS 1 6 SH UF FL E IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6 PA SS

IA DD S 1 6 SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6

IA DDS 1 6 SH UF FL E SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6 ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6

IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

each arithmetic unit has a separate local register fi le of IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS PA SS

SH UF FL E PA SS

SH UF FL E SH UF FL E SH UF FL E

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 PA SS

IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 PA SS PA SS PA SS

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 sixteen or thirty-two 32-bit words. The SRF has a capac- IA DD S 1 6 IA DD S 1 6 IA DD S 1 6

SH UF FL E SH UF FL E SH UF FL E DA TA _I N

IA DDS 1 6 SH UF FL E IA DD S 1 6 DA TA _I N PA SS

(Above) Single iteration schedule IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 PA SS DA TA _I N PA SS

IA DDS 1 6 IA DD S 1 6 S EL E C T DA TA IN ity of 32KB 32-bit words (128KB) and can read 16 words IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 N S EL E C T DA TA OU T S EL E C T IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 SE LE CT DA TA _I N DA TA _O UT

IA DD S 1 6 NS ELEC T DA TA _I N DA TA _O UT

IA DD S 1 6 NS ELEC T DA TA _O UT NS ELEC T

IA DD S 1 6 IA DD S 1 6 DA TA _O UT LO OP per cycle (two words per cluster). The clusters are con- (Right) Software pipelining shown DA TA _O UT DA TA _O UT

ADD0 ADD1 ADD2trolledMUL0 MUbyL1 a DI576-bitV0 INP0 INmicroinstruction.P1 INP2 INP3 OUT0 OUT1 TheSP_0 microcontrolSP_0 COM0 MC_0 JU K0 VAL0 store holds 2K such instructions. The memory system PASS IADDS16 IMULRND16 IMULRND16 NSELECT PASS

IADDS16 IADDS16 SHIFTA16 IMULRND16 IMULRND16 PASS NSELECT

IADDS16 IADDS16 IMULRND16 IMULRND16 interfaces to four 32-bit-wide SDRAM banks and reorders GE NC IS TA TE

SHUFFLE IADDS16 IADDS16 IMULRND16 IMULRND16 CO ND _I N_ D

GE N_ CC EN D SHUFFLE IMULRND16 IMULRND16 PASS

SP CR EAD _W T SPC WR IT E

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 PASS memory references to optimize bandwidth. Imagine also CO MM UC DA TA

CHK _A NY IADDS16 IADDS16 IMULRND16 IMULRND16

IADDS16 SHUFFLE IADDS16 IMULRND16 IMULRND16 PASS SE LE CT IADDS16 SHUFFLE includesIMULRND16 IMUL RND1a6 network interface and router for connection to IADDS16 SHUFFLE SHUFFLE IMULRND16 IMULRND16

SH IF TA 1 6

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 COMMUCPERM IADDS16 IADDS16 IADDI/OS16 IMULdevicesRND16 IMULRND16 and to combine multiple Imagines COforMMUCPERM larger IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 COMMUCPERM

IADDS16 IADDS16 IMULRND16 IMULRND16 PASS PASS SHUsignal-processingFFLE SELECT tasks. PASS SHUFFLE SHUFFLE SHUFFLE SELECT COMMUCPERM

CO MM UC PE RM IADDS16 IADDS16 IADDS16 IMULRND16 SELECT PASS

CO MM UC PE RM IADDS16 IADDS16 IADDS16 PASS IMULRND16 PASS PASS CO MM UC PE RM

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT

SE LE CT IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16

SE LE CT CO MM UC PE RM

IMULRND16 IMULRND16 PASS IM UL RN D 1 6 SE LE CT

IM UL RN D 1 6 SHUFFLE SHUFFLE SHUFFLE IMULRND16 IMULRND16 DATA_IN PASS

IM UL RND 1 6 IM UL RN D 1 6 NS ELEC T

IADDS16 SHUFFLE IADDS16 IMULRND16 IMULRND16 PASS DATA_IN PASS GEN_CISTATE IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDStreamS16 IMULRND16 processorsIMULRND16 PASS COND_IN_D dependDATA_IN on parallelism and localityPASS IM UL RN D 1 6 IM UL RN D 1 6 PA SS IM UL RN D 1 6 IM UL RN D 1 6 PA SS IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_IN GEN_CCEND

IM UL RND 1 6 IM UL RN D 1 6 PA SS

IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_OUT SPCREAD_WT SPCWRITE SELECT IM UL RN D 1 6 IM UL RN D 1 6

IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_IN DATA_OUT COMMUCDATA for their effi ciency. For an application to stream well, IM UL RN D 1 6 IM UL RN D 1 6

IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_IN DATA_OUT CHK_ANY IM UL RND 1 6 IM UL RN D 1 6

IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 NSELECT DATA_OUT NSELECT CHALLENGES IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 IADDS16 IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_OUT NSELECT LOOP there must be suffi cient parallel work to keep all of the IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 NS ELEC T

IADDS16 IADDS16 IMULRND16 IMULRND16 SELECT DATA_OUT DATA_OUT PASS IA DD S 1 6 IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6 SE LE CT PA SS

PA SS IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 NS ELEC T PA SS

IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS NS ELEC T

arithmetic units in all of the clusters busy. The parallel- IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

SH UF FL E IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6 PA SS

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS ism need not be regular, and the work performed on each IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 IA DDS 1 6 SH UF FL E IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6 PA SS

IA DD S 1 6 SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6

IA DDS 1 6 SH UF FL E SH UF FL E IM UL RN D 1 6 IM UL RN D 1 6 stream element need not be of the same type or even the IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RND 1 6 IM UL RN D 1 6

IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6

IA DDS 1 6 IA DD S 1 6 IM UL RN D 1 6 IM UL RN D 1 6 PA SS PA SS

SH UF FL E PA SS

same amount. If there is not enough work to go around, SH UF FL E SH UF FL E SH UF FL E

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 PA SS

IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 PA SS PA SS PA SS

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6

however, many of the stream processor’s resources will IA DD S 1 6 IA DD S 1 6 IA DD S 1 6

SH UF FL E SH UF FL E SH UF FL E DA TA _I N (Above) Single iteration schedule IA DDS 1 6 SH UF FL E IA DD S 1 6 DA TA _I N PA SS idle and effi ciency will suffer. For this reason, stream pro- IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 PA SS DA TA _I N PA SS

IA DDS 1 6 IA DD S 1 6 S EL E C T DA TA IN

IA DD S 1 6 IA DD S 1 6 IA DD S 1 6 N S EL E C T DA TA OU T S EL E C T

IA DDS 1 6 IA DD S 1 6 IA DD S 1 6 SE LE CT DA TA _I N DA TA _O UT cessors cannot effi ciently handle some control-intensive IA DD S 1 6 NS ELEC T DA TA _I N DA TA _O UT IA DD S 1 6 NS ELEC T DA TA _O UT NS ELEC T (Right) Software pipelining shown IA DD S 1 6 IA DD S 1 6 DA TA _O UT LO OP applications that are dominated by a single sequential DA TA _O UT DA TA _O UT thread of control with little data parallelism. A streaming application must also have suffi cient kernel and producer- consumer locality to keep global bandwidth from becom- ing a bottleneck. A program that makes random memory FIGFIG 6 58 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 59

QUEUE QUEUE FOCUS DSPs

sor in C, learning to use the stream programming tools and writing a complex streaming application still Stream represents a signifi cant effort. For evolutionary applica- tions, it is often easier to reuse the existing code base for a conventional DSP, or the existing netlist for an ASIC Processors rather than to develop new streaming code. An applica- tion must require both effi ciency and fl exibility to over- Programmability with Effi ciency come this inertia.

Figure 9 shows a roadmap that illustrates how we expectTHE FUTURE stream IS processorsSTREAMS hash to evolve with improving coordinate span semiconductor process matrixx transform prep technology. The fi gure sort shows two lines of evolu- tion. The top line repre- span viewportort project convert merge sents fl oating-point stream processors (that, like Imag- ine, support 32-bit fl oat- lightss shader rasterize compact ing-point operations), and the bottom line represents fi xed-point processors that Z compare support just 8-, 16- and 32-bit integer operations. open GL Integer operations are depth image suffi cient for most signal FIG 7 buffer buffer processing operations and, SRF allocation as the fi gure indicates, are signifi cantly more effi cient in terms of both area and power. Each point in the roadmap repre- sents a stream processor (integer or fl oating point) implemented in a par- ticular technology with a nominal die size of 1 cm2 (1.3 cm2 for the fl oating- point processors). For each point, the fi gure shows the performance and power at full voltage and the performance and power at reduced voltage (for more effi cient operation). The performance, power, and area may be scaled up or down over a wide range by

60 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 61

QUEUE QUEUE varying the number of clusters in the stream processor.9 machine for video-rate dense depth mapping and its new The main competition for stream processors are fi xed- applications. Proceedings of the 15th Computer Vision and function (ASIC or ASSP) processors. Though ASICs have Pattern Recognition Conference (June 1996), 196−202. effi ciency as good as or better than stream processors, 2. Khailany B., Dally, W. J., Rixner, S., Kapasi, U. J., they are costly to design and lack fl exibility. It takes about Owens, J. D., Towles, B. Exploring the VLSI of $15 million and 18 months to design a high-performance stream processors. Proceedings of the Ninth International signal-processing ASIC for each application, and this cost Symposium on High Performance Computer Architecture (Feb. is increasing as semiconductor technology advances. In 2003), 153−164. contrast, a single stream processor can be reused across many applications with no incremental design cost, and Imagine: Prototype Stream Processor software for a typical application can be developed in about six months for about $4 million.10 In addition, this fl exibility improves effi ciency in applications where mul- HI SC NI tiple modes must be supported. The same resources can microcontroller be reused across the modes, rather than requiring dedi- cated resources for each mode that remain idle when the cluster 7

system is operating in a different mode. Also, fl exibility 3 cluster 6 permits new algorithms and functions to be easily imple- Mbank SRF cluster 5

mented. Often the performance and effi ciency advantage 2 cluster 4 of a new algorithm greatly outweighs the small advantage Mbank cluster 3

of an ASIC over a stream processor. 1 cluster 2 FPGAs are fl exible, but lack effi ciency and program- Mbank mability. Because of the overhead of gate-level confi gu- cluster 1 0

rability, processors implemented with FPGAs have an Mbank cluster 0 effi ciency of 2-10 MOPS per megawatt, comparable to that of conventional processors and DSPs. Newer FPGAs include large function blocks such as multipliers and microprocessors to partly address this effi ciency issue. Also, though FPGAs are fl exible, they are not program- mable in a high-level FIGFIG 8 language. Manual design to the register-transfer level is required for an FPGA, just as with an ASIC. Advanced floating floating floating point point point compilers may someday ease the programming burden of FPGAs. With competitive Imagine energy effi ciency, lower recurring costs, and the advantages of fl exibility, fixed fixed fixed we expect stream proces- 9 point point point

sors to replace ASICs in the most demanding of signal- processing applications. G Q

1. Kanade, T., Yoshida, I A., Oda, K., Kano, H., REFERENCES and Tanaka, M. A stereo 9 FIG F

60 March 2004 rants: [email protected] more queue: www.acmqueue.com March 2004 61

QUEUE QUEUE FOCUS DSPs

neering and computer science. At Stanford University, his group has developed the Imagine processor, which Stream introduced the concepts of stream processing and parti- tioned register organizations. Dally has worked with Cray Research and to incorporate many of these innova- Processors tions in commercial parallel computers, with Avici Sys- tems to incorporate this technology into Internet routers, Programmability with Effi ciency and he cofounded Velio Communications to commercial- ize high-speed signaling technology, and Stream Proces- 3. Owens, J. D., Khailany, B., Towles, B., and Dally, W. J. sors to commercialize stream processor technology. He Comparing Reyes and OpenGL on a stream architecture. is a fellow of IEEE and ACM, and has received numerous Siggraph/Eurographics Workshop on Graphics Hardware (Sept. honors including the ACM Maurice Wilkes award. He has 2002), 47−56. published more than 150 papers in these areas and is an 4. A high-end may consume 10 nJ author of the textbooks Digital Systems Engineering (Cam- or more per instruction because of much greater over- bridge University Press, 1998) and Principles and Practices heads required for acceleration techniques such as branch of Interconnection Networks (Morgan Kaufmann, 2003). prediction, , and out-of-order execution. UJVAL J. KAPASI was expected to receive his doctorate 5. Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, from Stanford University in February 2004. His research U. J., and Owens, J. D. Register organization for media interests include computer architecture, scientifi c processing. Proceedings of the Sixth International Symposium computing, and language and compiler design. While on High Performance Computer Architecture (Jan. 2000), at Stanford, he was an architect of the Imagine stream 375−387. processor and contributed to the VLSI implementation 6. Mattson, P., Dally, W. J., Rixner, S., Kapasi, U. J., and of an Imagine prototype. Recently, he cofounded Stream Owens, J. D. Communication scheduling. Proceedings Processors, which is commercializing the stream-process- of the International Conference on Architectural Support for ing technology developed at Stanford. Programming Languages and Operating Systems (Nov. 2000), BRUCEK KHAILANY received his Ph.D. from Stanford 82−92. University in 2002, where he was the principal VLSI 7. Kapasi, U. J., Dally, W. J., Rixner, S., Mattson, P. R., designer of the Imagine stream processor. Recently, as a Owens, J. D., and Khailany, B. Effi cient conditional opera- cofounder of Stream Processors, he has been working on tions for data-parallel architectures. Proceedings of the 33rd the commercialization of stream processors in a variety of Annual IEEE/ACM International Symposium on Microarchi- application areas. He is a member of IEEE and ACM, was tecture (Dec. 2000), 159−170. an Intel Foundation fellowship recipient at Stanford, and 8. Khailany, B., Dally, W. J., Rixner, S., Kapasi, U. J., received a BSEE from the University of Michigan in 1997. Mattson, P., Namkoong, J., Owens, J. D., Towles, B., and JUNG HO AHN is a Ph.D. candidate in electrical engineer- Chang, A. Imagine: Media processing with streams. IEEE ing at Stanford University. His research interests include Micro 21, 2 (Mar./Apr. 2001), 35−46. computer architecture, stream architecture, advanced 9. See reference 2. memory design, and compiler design. Ahn received an 10. Robles, R. The cost/benefi t ratio of ASICs in wireless MS in electrical engineering from Stanford. He is a stu- baseband modems. Communications Design Conference dent member of ACM and IEEE. (Oct. 2003); http://www.commdesignconference.com/ ABHISHEK DAS is a Ph.D. candidate in electrical engi- archive/papers/2003/P212.htm. neering at Stanford University. He received his B.Tech in computer science and engineering from IIT Kharagpur, LOVE IT, HATE IT? LET US KNOW India. His research interests include compiler and lan- [email protected] or www.acmqueue.com/forums guage design, computer systems software, and computer architecture. He has worked on making TCP/IP feasible WILLIAM J. DALLY is the Willard R. and Inez Kerr Bell for the Bluetooth technology at the IBM Research Center, Professor of Engineering at Stanford University. Dally has India. He is currently involved in the development and done pioneering development work at Bell Telephone enhancement of the compiler and architecture for the Laboratories, Caltech, and the Massachusetts Institute of Merrimac and Imagine architectures. Technology, where he was a professor of electrical engi- © 2004 ACM 1542-7730/04/0300 $5.00

62 March 2004 rants: [email protected]

QUEUE