IMAGINE: MEDIA PROCESSING WITH STREAMS

THE POWER-EFFICIENT IMAGINE ACHIEVES PERFORMANCE

DENSITIES COMPARABLE TO THOSE OF SPECIAL-PURPOSE EMBEDDED

PROCESSORS. EXECUTING PROGRAMS MAPPED TO STREAMS AND KERNELS, A

SINGLE IMAGINE PROCESSOR IS EXPECTED TO HAVE A PEAK PERFORMANCE OF

20 GFLOPS AND SUSTAIN 18.3 GOPS ON MPEG-2 ENCODING.

Media-processing applications, such Conventional general-purpose architectures as signal processing, 2D- and 3D-graphics ren- don’t efficiently exploit the available data par- dering, and image and audio compression and allelism in media applications. Their memo- decompression, are the dominant workloads ry systems depend on caches optimized for Brucek Khailany in many systems today. The real-time con- reducing latency and data reuse. Finally, they straints of media applications demand large don’t scale to the numbers of arithmetic units William J. Dally amounts of absolute performance and high or registers required to support a high ratio of performance densities (performance per unit computation to memory access. In contrast, Ujval J. Kapasi area and per unit power). Therefore, media- special-purpose architectures take advantage processing applications often use special- of these characteristics because they effective- Peter Mattson purpose (custom), fixed-function hardware. ly exploit and computation- General-purpose solutions, such as program- al intensity with a large number of arithmetic Jinyung Namkoong mable digital signal processors (DSPs), offer units. Also, special-purpose processors direct- increased flexibility but achieve performance ly map the algorithm’s dataflow graph into John D. Owens density levels two or three orders of magnitude hardware rather than relying on memory sys- worse than special-purpose systems. tems to capture locality. Brian Towles One reason for this performance density gap Another reason for the performance densi- is that conventional general-purpose architec- ty gap is the constraints of modern technolo- Andrew Chang tures are poorly matched to the specific prop- gy. Modern VLSI computing systems are erties of media applications. These applications limited by communication bandwidth rather share three key characteristics. First, operations than arithmetic. For example, in a contem- on one data element are largely independent porary 0.15-micron CMOS technology, a 32- of operations on other elements, resulting in bit integer requires less than 0.05 mm2 Scott Rixner a large amount of data parallelism and high of chip area. Hundreds to thousands of these latency tolerance. Second, there is little glob- arithmetic units fit on an inexpensive 1-cm2 Rice University al data reuse. Finally, the applications are com- chip. The challenge is supplying them with putationally intensive, often performing 100 instructions and data. General-purpose to 200 arithmetic operations for each element processors that rely on global structures such read from off-chip memory. as large multiported register files to provide

0272-1732/01/$10.00  2001 IEEE 35 Imagine

Input data Kernel

Stream Output data Right Convolve Convolve camera image 7×7 3×3

Depth SAD map

Left Convolve Convolve camera image 7×7 3×3

Figure 1. Stereo depth extraction.

data bandwidth cannot scale to the number as streams and expresses all computation as of arithmetic logic units (ALUs) required for kernels. A stream is a sequence of similar ele- high performance rates in media applications. ments. Each stream element is a record, such In contrast, special-purpose processors have as 21-word triangles or 8-bit pixels. A kernel many ALUs connected by dedicated wires and consumes a set of input streams, performs a buffers to provide needed data and control computation, and produces a set of output bandwidth. However, special-purpose solu- streams. Streams passing among multiple com- tions lack the flexibility to work effectively on putation kernels form a stream program. a wide application space. Figure 1 shows how we map stereo depth To provide programmability yet maintain extraction,1 a typical image-processing appli- high performance densities, we developed the cation, to the stream programming model. Imagine stream processor at Stanford Uni- Stereo depth extraction calculates the dispar- versity. Imagine consists of a programming ity between objects in a left and right gray- model, software tools, and an architecture, all scale camera image pair. It outputs a depth designed to operate directly on streams. The map corresponding to each object’s distance stream programming model exposes the local- from the cameras (in the example depth map, ity and concurrency in media applications so brighter pixels are closer to the camera). The they can be exploited by the stream architec- input camera images are formatted as streams; ture. This architecture uses a novel register file in the example, a single stream contains a row organization that supports 48 floating-point of pixels from the image. The row streams arithmetic units. We expect a prototype Imag- flow though two convolution kernels, which ine processor to achieve 18.3 giga operations produce a filtered stream. The filtered streams per second (GOPS) in MPEG-2 encoding then pass through a sum-of-absolute-differ- applications, corresponding to 105 frames per ences (SAD) kernel, which produces a stream second on a 720 × 480-pixel, 24-bit color containing a row of the final depth map. Each image while dissipating 2.2 W. kernel also outputs a stream of partial sums (represented by the circular arrows in the dia- Stream programming model gram) needed by the same kernel as it - The stream programming model allows sim- es future rows. ple control, makes communication explicit, Stereo depth extraction possesses important and exposes the inherent parallelism of media characteristics common to all media-process- applications. A stream program organizes data ing applications:

36 IEEE MICRO Imagine processor

Stream Host controller

Arithmetic SDRAM cluster

SDRAM Arithmetic cluster

SDRAM Stream SDRAM

Streaming memory Arithmetic Network cluster devices Network Local (other interface register Imagines file or I/O)

2.67 Gbytes/s 32 Gbytes/s 544 Gbytes/s Figure 2. Imagine stream architecture.

• Little data reuse. Pixels are read once from • Store transfers streams from the SRF to memory and are not revisited. off-chip DRAM. • High data parallelism. The same set of • Receive transfers streams from the net- operations independently computes all work to the SRF. pixels in the output image. • Send transfers streams from the SRF to • Computationally intensive. The applica- the network. tion requires 60 arithmetic operations per • Cluster op executes a kernel in the arith- memory reference. metic clusters that reads inputs streams from the SRF, computes output streams, Mapping an application to streams and ker- and writes the output streams to the SRF. nels exposes these characteristics so that hard- • Load loads streams consisting ware can easily exploit them. Streams expose of kernel microcode—576-bit very long an application’s inherent data parallelism. The instruction word (VLIW) instructions— flow of streams between kernels exposes the from the SRF into the microcontroller application’s communication requirements. instruction store (a total of 2,048 instructions). Stream architecture Imagine is a programmable processor that Imagine supports streams up to 32K directly executes applications mapped to streams (32,768) words long. The host processor and kernels. Figure 2 diagrams the Imagine sends stream instructions to the stream con- stream architecture. The processor consists of a troller. When instructions are ready for issue, 128-Kbyte stream register file (SRF), 48 float- the stream controller decodes them and issues ing-point arithmetic units in eight arithmetic the necessary commands to other on-chip clusters controlled by a microcontroller, a net- modules. work interface, a streaming memory system Each arithmetic cluster, detailed on the with four SDRAM channels, and a stream con- right side of Figure 2, contains eight func- troller. Imagine is controlled by a host proces- tional units. A small two-ported local register sor, which sends it stream instructions. The file (LRF) connects to each input of each func- following are the main stream instructions: tional unit. An intracluster switch connects the outputs of the functional units to the • Load transfers streams from off-chip inputs of the LRFs. During kernel execution, DRAM to the SRF. the same VLIW instruction is broadcast to all

MARCH–APRIL 2001 37 Imagine

eight clusters in single-instruction, multiple- to the pixel elements’ locations, which are laid data (SIMD) fashion. out sequentially in memory. The memory sys- Kernels typically loop through all input- tem then reorders these accesses to maximize stream elements, performing a compound DRAM bandwidth. As the Load operation stream operation on each element. A com- executes, elements are written back to the SRF pound stream operation reads an element in their original order. At completion of the from its input stream(s) in the SRF and com- operation, the SRF contains a stream consist- putes a series of arithmetic operations (for ing of all the pixels in a row. example, matrix multiplication or convolu- tion). Then it appends the results to output Convolve7×7 kernel execution. Next, the streams that are transferred back to the SRF. stream controller issues pseudocode line 4, A compound stream operation stores all tem- causing the convolve7×7 kernel (Figure 3c) to porary data in the LRFs. It sends only final execute. This kernel has two input streams: results to the SRF. the pixel stream just loaded from memory Compound stream operations are com- (src) and a stream containing the partial sums pound in the sense that they perform multi- produced earlier in the application (old_par- ple arithmetic operations on each stream tials7). During kernel execution, the micro- element. In contrast, conventional vector controller fetches VLIW microcode operations perform a single arithmetic oper- instructions from the instruction store every ation on each vector element and store the cycle. Then the microcontroller sends these results back in the vector register file after each instructions to all eight arithmetic clusters in operation. parallel. As the kernel executes, it loops through all Stream application example input-stream elements. During each loop iter- The stereo depth extraction application pre- ation, the clusters read eight new stream ele- sented in Figure 1 demonstrates how the Imag- ments (one new element to each cluster) in ine stream architecture functions. The parallel from the SRF into their LRFs. Then convolution stage performs a pair of 2D con- each cluster performs the same compound volutions on both the left and right camera stream operation on its stream element. images, first using a 7 × 7 filter and then a 3 × In the convolve7×7 kernel, the compound 3 filter. (The application uses two separate fil- stream operation must first perform six inter- ters to reduce the operation count.) Figure 3a cluster communications that broadcast the shows the application-level pseudocode for the current data. As a result, the first cluster has convolution of one image. Figure 3b shows the values src[n − 6: n], the second has src[n − how the commands in pseudocode lines 3 5: n + 1], and so on. Edge clusters use data through 6 map to stream operations on the buffered from the previous iteration if neces- Imagine architecture (circled numbers indi- sary. Each cluster then multiplies these seven cate pseudocode lines). values with a 7 × 7 matrix of filter coefficients The stream controller dynamically sched- by executing seven dot products. The kernel ules each stream instruction. When the uses the seven results to generate a new output required resources are ready and interinstruc- element and to update the partial sums for tion dependencies are satisfied, the stream future rows. controller issues the following operations: The convolve computation occurs in par- allel on all eight arithmetic clusters. In each Load operation. Line 3 translates into a stream loop iteration, the LRFs temporarily store Load instruction that causes a stream transfer intermediate data between operations. At the from off-chip DRAM to the SRF. In this case, end of the loop, the output stream elements the stream being transferred contains a row of are written back eight at a time (one from each the image. Each element in the stream is a cluster) to the SRF. In this manner, six values pixel from the source image, and the stream from each cluster are written to the partials7 length corresponds to the image width. Dur- output stream, and one new value per cluster ing the Load operation, the streaming mem- is written to the tmp output stream. Upon ory system generates addresses corresponding completion, the SRF contains a stream con-

38 IEEE MICRO 1 prime partials with arc_rows[0..5] 2 for (n = 6; n < numRows; n++){ 3 src = load( src_rows[n] ); //'src' stream gets one row of source image 4 convolve7x7( src, old_partials7, &tmp, &partials7 ); 5 convolve3x3( tmp, old_partials3, &cnv, &partials3 ); 6 convolved_rows[n-6 ] = store( cnv ); // store 'cnv' stream to memory 7 swap pointer to start of old_partials and partials for next time through the loop 8 } 9 drain partials to get convolved_rows[numRows-6..numRows-1]

(a)

Memory (or I/O) Stream register file Arithmetic clusters convolve7x7(in, partials_in, out, partials_out){ src Source : (one row in the image : 3 source image) while (! in.empty()){ in >> curr[6]; // Input stream element 4 // Communicate values to neighboring old_partials7 // clusters (edge clusters get buffered // data from prev iteration) Convolve7x7 for (i = 0; i < 6; i++) curr[i] = communicate(curr[6], perm[i]) partials7 for (i = 0; i < 7; i++) rowsum[i] = dotproduct(curr, fltr[6-i]);

partials_in >> p; tmp out << p[0] + rowsum[0]; for (i = 0; i < 5; i++) p[i] = p[i+1]+ rowsum [i+1]; p[5] = r[6]; 5 partials_out << p; old_partials3 } }

Convolve3x3

partials3

Convolved image cnv 6

(b) (c)

Figure 3. The convolution stage of stereo depth extraction: application-level pseudocode (a); Imagine execution of pseudocode lines 3 through 6 (b); kernel-level pseudocode for 7 × 7 convolution (c). sisting of a new row of convolved pixels (tmp) nel completes, the SRF contains the final con- and a stream consisting of partial sums (par- volved row of pixels. tials7) for later use. Storage. Finally, line 6 transfers the final out- Convolve3×3 kernel execution. Now the stream put stream to the memory system for storage controller issues the convolve3x3 kernel on and use in later stages of the application. line 5 of Figure 3a. The output stream of the previous convolve7x7 kernel is used as one of The rest of the stereo depth extraction the input streams. Once the convolve3×3 ker- application and other stream applications exe-

MARCH–APRIL 2001 39 Imagine

Table 1. Application bandwidth requirements versus Imagine’s bandwidth hierarchy.

Memory (Gbytes/s) SRF (Gbytes/s) Clusters (Gbytes/s) Performance Imagine peak 2.67 32 544 20 Gflops (40 16-bit GOPS) Depth 0.83 21.08 263.02 12.1 GOPS (16-bit) MPEG2 0.45 2.21 214.31 18.3 GOPS (16- and 8-bit) QR 0.37 3.99 294.82 13.1 Gflops Render 1.61 8.59 205.05 5.1 GOPS (16-bit and floating-point)

cute similarly. That is, kernels execute on the during kernel execution. arithmetic clusters and streams pass between Table 1 shows how four stream applications kernels through the SRF. In this manner, the use Imagine’s peak bandwidths. The four stream programming model maps directly to applications are the stereo depth extractor the Imagine architecture. As a result, Imag- (Depth) described earlier, a video-encoding ine’s 48 on-chip arithmetic units can take application (MPEG2), a QR matrix decom- advantage of the high ratio of arithmetic oper- position (QR), and polygon rendering on the ations to memory accesses. The SIMD nature SPECViewperf 6.1 ADVS-1 benchmark of the arithmetic clusters and compound (Render). Each application shows roughly an stream operations enables Imagine to exploit order-of-magnitude increase in bandwidth data parallelism. requirements across each hierarchy level. By Finally, multiprocessor solutions can take keeping the 48 arithmetic units supplied with advantage of even more parallelism by using data, the bandwidth hierarchy enables Imag- the network interface, which supports a peak ine to sustain a large amount of its peak per- bandwidth of 4 Gbytes per second in and out. formance in these applications. For example, the stereo depth extraction appli- Imagine’s bandwidth hierarchy also match- cation could use different Imagine processors es the capabilities of modern VLSI technolo- to work on different portions of the image to gy. The off-chip memory bandwidth is be convolved. The application could exploit roughly the maximum processor-to-DRAM even more parallelism with control partition- bandwidth available in today’s technology. ing: One set of processors could work on the The SRF bandwidth is roughly the maximum convolution stage and send the result streams bandwidth available from a single on-chip to another set working on the SAD stage. SRAM. The intracluster bandwidth is approx- imately the maximum bandwidth attainable Bandwidth hierarchy from on-chip interconnect. The stream programming model also exposes the application’s bandwidth require- Streaming memory system. The memory sys- ments to the hardware. Imagine exploits this tem supports data stream transfers between by providing a three-level bandwidth hierar- the SRF and off-chip DRAM. Imagine makes chy:2 off-chip memory bandwidth (2.67 all memory references using stream Load and Gbytes per second), SRF bandwidth (32 Store instructions that transfer an entire Gbytes/s), and intracluster bandwidth (544 stream between memory and the SRF. In the Gbytes/s). The three levels of the bandwidth convolution stage example, stream elements hierarchy correspond to the three columns of were laid out sequentially in memory. How- Figure 3b. Stream programs’ communication ever, the memory system also supports strid- patterns match this bandwidth hierarchy. ed, indexed, and bit-reversed record Stream programs use memory bandwidth addressing, for which records must be con- only for application input and output and tiguous. In addition, because memory-refer- when intermediate streams cannot fit in the ence granularity is streams, users can optimize SRF and must spill to memory. SRF band- the memory system for stream throughput width serves only when streams pass between rather than the reference latency of individ- kernels. Finally, intracluster bandwidth into ual stream elements. A stream consumer, such and out of the LRFs handles the bulk of data as a kernel executing in the arithmetic clus-

40 IEEE MICRO Stream Y: Stream X: 2 blocks 3 blocks 2 blocks Y31 X95 X63 X31 Y31 From memory system Y33

32 Y33 words Y32 Y0

22

32 words stream buffers 32 words To cluster 7 X63 X31 LRFs X15 X7

X48 X16 To cluster 0 X47 X15 LRFs 32 X8 X0 words X32 X0 Y0X64 X32 X0

1,024 blocks

SRAM Stream buffers Clients

Transfers between SRAM and stream buffers Transfers between stream buffers and clients Space allocated for stream Y but not yet written

Figure 4. Stream register file organization. ters, cannot begin until the entire stream is ciently. Imagine’s memory access scheduling available in the SRF. Therefore, the time it has achieved a 30 percent performance takes to load the entire stream is far more improvement on a set of representative media- important than the latency of any particular processing applications.3 reference. Moreover, these references and computation can easily overlap. During the Stream register file. The SRF contains a 128- current set of computations, the memory sys- Kbyte SRAM organized as 1,024 blocks of 32 tem can load streams for the next set and store words of 32 bits each. Figure 4 shows two con- streams from the previous set. current SRF accesses. Stream Y (length 64 Imagine uses memory access scheduling to words) is being written from the memory sys- reorder the DRAM operations (bank tem to the SRF, and stream X (length 96 precharge, row activation, and column access) words) is being read from the SRF by the necessary to complete the set of currently arithmetic clusters. All client stream accesses pending memory references. When schedul- go through stream buffers, which can hold ing memory accesses, the memory system can two blocks of a stream each. Imagine has a delay references to let other references access total of 22 stream buffers, each with different- the DRAM, improving memory bandwidth. size ports: Furthermore, a good reordering reduces the average latency of memory references by using • eight cluster stream buffers (eight words the DRAM’s internal resources more effi- per cycle),

MARCH–APRIL 2001 41 Imagine

• eight network stream buffers (two words tains eight functional units: three adders, two per cycle), multipliers, a divide and square-root unit, a • four memory system stream buffers (one scratch-pad register file unit, and an interclus- word per cycle), ter communication unit. The adders and mul- • one microcontroller stream buffer (one tipliers execute floating-point and integer word per cycle), and arithmetic with a throughput of one instruc- • one stream buffer that interfaces to the tion per cycle at varying latencies. They also sup- host processor (one word per cycle). port parallel subword operations such as those in MMX5 and other multimedia extensions.6 Different clients have different numbers of The divide and square-root unit executes float- stream buffers with various-size ports to pro- ing-point and integer operations and is vide the bandwidths and number of concur- pipelined to accept two instructions every 13 rent streams necessary for a typical cycles. The scratch-pad unit is a 256-word reg- application. ister file that executes indexed read and write During execution, active stream buffers instructions useful for small table lookups. The contend for access to a SRAM block. In Fig- intercluster communication unit, a software- ure 4, the second block of stream X is cur- controlled crossbar between the clusters, is used rently being read from the SRAM and written when kernels are not fully data parallel. The into the bottom stream buffer. The clusters communication unit is also used by condition- read eight words out of their stream buffers at al streams,7 a mechanism for handling data- a time (one word to each cluster). In this dependent conditionals in a stream architecture. example, the clusters have already completed In front of each functional unit is a small, two reads (X0-X7 and X8-X15) on previous 16-word, two-ported LRF. (Multipliers have cycles from the bottom stream buffer. Once 32-word LRFs because many kernels require the clusters have read elements X16-X31, the a larger number of registers in the multiplier third block of stream X (X64-X95) replaces than in other units.) The intracluster switch X0-X31 in the stream buffer. transfers data between functional units or Concurrently, the memory system is writ- cluster stream buffers and LRFs. The switch ing element Y33 of stream Y into its stream consists of a in front of every LRF buffer. Once elements Y34-Y63 are written input; the multiplexer selects a functional unit into the stream buffer, the second block of or cluster stream buffer output to write from. stream Y (Y32-Y63) is transferred to the The LRFs are a much more area- and power- SRAM, and the SRF stream write ends. The efficient structure for providing local band- first block of the stream (Y0-Y31) was trans- width and storage than the equivalent ferred into the SRF on a previous cycle. multiported global register file. A multiported Because all data and accesses to the SRF are register file has, in effect, an implicit switch that organized as streams, stream buffers can take is replicated many times inside each register advantage of the sequential access pattern of . The distributed LRFs replace this struc- streams. Stream buffers prefetch data one ture with an explicit switch outside the LRFs. block at a time from the SRF, while clients read data from stream buffers at lower band- Evaluation widths. As a result, stream buffers effectively The Imagine stream architecture directly time-multiplex the single physical port of the runs media applications written in the stream SRF SRAM into 22 logical ports that can be programming model and expressed in two accessed simultaneously. In contrast with programming languages: StreamC and Ker- building large multiported structures to pro- nelC. A StreamC program describes the inter- vide the necessary bandwidth, stream buffers actions of streams and kernels and lead to an area- and power-efficient VLSI corresponds to the pseudocode in Figure 3a. implementation.4 Moreover, they provide a A kernel is written in KernelC and corre- simple, extensible mechanism by which clients sponds to the pseudocode in Figure 3c. Both can execute stream transfers with the SRF. languages use C-like syntax. StreamC is com- piled by a stream scheduler, which handles Arithmetic clusters. Each arithmetic cluster con- SRF block allocation and memory manage-

42 IEEE MICRO Imagine and related work Prior industry and research media processors fall into four categories: on the stream register file by an order of magnitude or more. Imagine also differs from vector processors by using stream buffers to take advantage • VLIW media processors and DSPs, of the sequential data access patterns of streams and by supporting data • SIMD extensions, records. Finally, vector processors execute instructions from one sequencer • hardwired stream processors, and that manages tightly coupled vector and scalar units. In contrast, Imagine • vector processors. works directly on the stream programming model. Therefore, Imagine main- tains an explicit division between stream-sequencing instructions (writ- VLIW DSPs, such as the C6x,1 and VLIW media ten in StreamC) executed on the host processor and VLIW instructions processors, such as Trimedia,2 efficiently exploit the instruction-level par- (written in KernelC) executed on Imagine’s arithmetic clusters. allelism in media applications. SIMD extensions of instruction sets enable many general-purpose processors3,4 to perform low-precision operations and exploit fine-grained data parallelism. References Imagine’s VLIW arithmetic clusters and subword arithmetic operations 1. T. Halfhill, “TI Cores Accelerate DSP Arms Race,” Micro- build on this previous work. However, the stream architecture enables processor Report, Mar. 2000, pp. 22-28. Imagine to scale to much larger numbers of ALUs. Moreover, because 2. S. Rathnam and G. Slavenburg, “An Architectural Overview both the VLIW and the SIMD approaches focus on archi- of the Programmable Multimedia Processor, TM-1,” Proc. tecture, neither addresses the data bandwidth problem solved by Imag- Compcon, IEEE Computer Society Press, Los Alamitos, Calif., ine’s register hierarchy. 1996, pp. 319-326. Hardwired stream processors attempt to map the stream programming 3. A. Peleg and U. Weiser, “MMX Technology Extension to the model directly into hardware. Cheops,5 for example, consists of a set of Architecture,” IEEE Micro, vol. 17, no. 4, Aug. 1996, pp. specialized stream processors. Each processor accepts one or two data 42-50. streams as input and produces one or two data streams as output. Data 4. M. Tremblay et al., “VIS Speeds New Media Processing,” streams are either forwarded directly from one stream processor to the IEEE Micro, vol. 16, no. 4, Aug. 1996, pp. 10-20. next according to the application’s dataflow graph or transferred between 5. V.M. Bove Jr. and J.A. Watlington, “Cheops: A Reconfigurable memory and the stream processors. Data-Flow System for Video Processing,” IEEE Trans. Circuits Programmable stream processors like Imagine are closest in spirit to and Systems for Video Technology, vol. 5, no. 2, Apr. 1995, vector processors. Vector ,6 SIMD processor arrays, and pp. 140-149. vector such as the T0,7 Vector IRAM,8 and others effi- 6. R. Russell, “The Cray-1 Computer System,” Comm. ACM, ciently use vectors to exploit data parallelism. Vector memory systems vol. 21, no. 1, Jan. 1978, pp. 63-72. are suitable for media processing because they are optimized for band- 7. J. Wawrzynek et al., “Spert-II: A Vector Sys- width instead of latency. Vector processors perform simple arithmetic tem”, Computer, vol. 29, no. 3, Mar. 1996, pp. 79-86. operations on vectors stored in a vector register file, using three words of 8. C. Kozyrakis, A Media-Enhanced Vector Architecture for vector register bandwidth per arithmetic operation. Embedded Memory Systems, tech. report UCB/CSD-99-1059, In contrast, Imagine performs compound stream operations, using local Univ. of California at Berkeley, Computer Science Div., Berke- register files to stage intermediate results. This reduces bandwidth demand ley, Calif., 1999. ment. The stream scheduler runs on the host cycle time and a 167-MHz SDRAM. Indi- processor and uses a combination of static and vidual kernels are usually part of a larger appli- runtime techniques to send stream instruc- cation, so we simulated the kernels with their tions to the Imagine processor’s stream con- working sets already in the SRF. troller. KernelC is compiled statically by an Table 2 shows that Imagine sustained over optimizing VLIW kernel scheduler that han- half its peak performance on a range of appli- dles the communication required by the dis- cations and kernels. The performance dif- tributed register file architecture of the ferences between kernels and applications arithmetic clusters.8 reflect the overhead of accessing off-chip Using these programming tools, we wrote memory and the start-up cost of stream kernels and applications for the Imagine transfers. An example is the performance processor. We simulated the four applications degradation between the discrete cosine listed in Table 2 (next page) on a cycle-accu- transform kernel and the rest of the MPEG- rate C++ simulator, assuming a 500-MHz 2 encoding application.

MARCH–APRIL 2001 43 Imagine

Table 2. Application and kernel performance.

Arithmetic Power Application bandwidth estimate (W)performance Applications Depth 12.1 GOPS (16-bit) 2.9 320 × 240 8-bit gray scale at 212 frames/s MPEG2 18.3 GOPS (16- and 8-bit) 2.2 720 × 480 24-bit color at 105 frames/s QR 13.1 Gflops 3.6 192 × 96 matrix decomposition in 1.1 ms Render 5.1 GOPS (16-bit and floating-point) 2.9 14.9 million vertices/s (16.8 million pixels/s) Kernels Discrete cosine transform 22.6 GOPS (16-bit) 2.6 34.8 ns per 8 × 8 block (16-bit) Convolve7×7 25.6 GOPS (16-bit) 3.0 1.5 µs per row of 320 16-bit pixels Fast Fourier transform 6.9 Gflops 3.8 7.4 µs per 1,024-point floating-point complex FFT

cy across a range of applications are roughly Host interface/ Network an order of magnitude higher than those of stream controller interface today’s programmable processors such as the Texas Instruments TMS320C67x. For com- Microcontroller parison, a 167-MHz TI TMS320C6701 exe- ALU cluster 7 cutes a 1,024-point floating-point complex FFT in 124.3 µs with an estimated 1.9-W ALU cluster 6 power dissipation (0.41 Gflops and 0.22 9 ALU cluster 5 Gflops/W). Imagine’s performance is also Stream

12 mm comparable to that of special-purpose proces- register ALU cluster 4 file sors on complex applications such as polygon ALU cluster 3 rendering. A study compared Imagine poly-

Memory system gon-rendering simulations with a 120-MHz ALU cluster 2 with DDR SDRAM on an 10 ALU cluster 1 Intel AGP 4X system. In rendering a raster- ization-limited scene, the NVIDIA Quadro ALU cluster 0 provided a 5.3-times over Imagine. For geometry-limited scenes (such as the SPECViewperf 6.1 ADVS-1 polygon- 12 mm rendering scene with point-sampled texture Figure 5. Floor plan of the Imagine prototype. listed in Table 2), Imagine outperformed the NVIDIA Quadro.

Table 2 also shows Imagine’s estimated VLSI implementation power dissipation. We used trial layouts, We are currently designing a prototype netlists, and SRAM data sheets in conjunction Imagine stream processor. Our goal is to vali- with average unit occupancies from the cycle- date architectural studies and provide an exper- accurate simulator to estimate the total capac- imental prototype for future research. The itance switched at runtime. We expect Imagine processor consists of approximately 21 million to dissipate less than 4 W over a range of appli- transistors: 6 million in the SRF SRAM, 6 mil- cations, providing power efficiencies of more lion in the microcode store, 6 million in the than 2 Gflops/W. Imagine’s power efficiency arithmetic clusters, and 3 million elsewhere. stems from its efficient register file organiza- We have targeted the prototype to operate at tion. For architectures with many ALUs, the 500 MHz in a 0.15-µm, 1.5-V static CMOS SRF and LRF structures dissipate much less standard-cell technology. As Figure 5 shows, power in inter-ALU communications than do Imagine has a die size of 1.44 cm.2 It is also conventional register file organizations.4 area-efficient. Nearly 40 percent of the die area Imagine’s performance and power efficien- is devoted to the arithmetic clusters.

44 IEEE MICRO In addition to providing high performance focus on generalizing the stream model for a density, the Imagine stream architecture meets wider range of applications and refining the the challenges of modern VLSI design com- stream processor architecture. MICRO plexity. The architecture’s regularity lets designers heavily optimize one module and Acknowledgments then instantiate it in many parts of the design. We thank all the past and present members The statically scheduled arithmetic clusters of the Imagine team. We also thank our indus- contain explicit communication with only trial partners Intel and Texas Instruments. Our local data forwarding, eliminating the need research was supported by the Defense for complicated global control structures. Advanced Research Projects Agency under Data and control communication between ARPA order E254 and monitored by the Army different storage hierarchy levels (LRFs, SRF, Intelligence Center under contract DABT63- and streaming memory system) is local, mit- 96-C-0037. Several of the authors received igating the effect of long wire delays. support from the National Science Founda- The Imagine design team consists of eight tion and/or Stanford graduate fellowships. graduate students. Two to three people work on architecture and logic design, two to three References on software tools, and the others on VLSI 1. T. Kanade et al., “Development of a Video- implementation. With this limited staff, we Rate Stereo Machine,” Proc. Int’l Robotics need a CAD flow that minimizes design time and Systems Conf., IEEE Press, Piscataway, with minimal performance degradation. We N.J., 1995, pp. 95-100. use a typical application-specific integrated 2. S. Rixner et al., “A Bandwidth-Efficient circuit synthesis flow with automatic place- Architecture for Media Processing,” Proc. ment and routing for control blocks. For reg- 31st Int’l Symp. , IEEE ular data path blocks such as the multiplier Computer Society Press, Los Alamitos, array, we avoid synthesis, instead using data- Calif., 1998, pp. 3-13. path-style placement of standard cells with 3. S. Rixner et al., “Memory Access Scheduling,” automatic routing. We synthesize other, less Proc. 27th Ann. Int’l Symp. Computer Archi- regular blocks, such as the floating-point tecture, IEEE CS Press, 2000, pp. 128-138. adder, and use data-path-style standard-cell 4. S. Rixner et al., “Register Organization for placement. This CAD flow provides a good Media Processing,” Proc. Sixth Int’l Symp. trade-off between design time and perfor- High-Performance Computer Architecture, mance and helps the team build a high- IEEE CS Press, 2000, pp. 375-387. performance processor in spite of the limited 5. A. Peleg and U. Weiser, “MMX Technology resources. Extension to the Intel Architecture,” IEEE We completed many design tools and stud- Micro, vol. 17, no. 4, Aug. 1996, pp. 42-50. ies to verify Imagine’s performance and VLSI 6. M. Tremblay et al., “VIS Speeds New Media feasibility. We developed a cycle-accurate C++ Processing,” IEEE Micro, vol. 16, no. 4, Aug. simulator, which we used to generate all per- 1996, pp. 10-20. formance numbers given here. We also com- 7. U. Kapasi et al., “Efficient Conditional Oper- pleted a synthesizable register-transfer level ations for Data-Parallel Architectures,” Proc. model of Imagine and mapped the design to 33rd Int’l Symp. Microarchitecture, IEEE CS logic gates. Placement and routing are in Press, 2000, pp. 159-170. progress, and we expect to have prototype 8. P. Mattson et al., “Communication Sched- processors available in August 2001. uling,” Proc. Ninth Int’l Conf. Architectural Support for Programming Languages and o gain more experience with stream pro- Operating Systems, ACM Press, New York, Tgramming, we plan to build Imagine sys- 2000, pp. 82-92. tems ranging from a single-processor, 9. T. Halfhill, “TI Cores Accelerate DSP Arms 20-Gflops system to a 64-processor, Tflops Race,” Microprocessor Report, Mar. 2000, system for demanding applications. pp. 22-28. Stream processing is a promising technolo- 10. J. Owens et al., “Polygon Rendering on a gy now in its infancy. Our future research will Stream Architecture,” Proc. SIGGRAPH/

MARCH–APRIL 2001 45 Imagine

Eurographics Workshop on Graphics Hard- interconnection networks. Namkoong has a ware, ACM Press, New York, 2000, pp. 23-32. BS in electrical engineering and computer sci- ence from the University of California at Brucek Khailany is a PhD candidate in the Berkeley and an MS in electrical engineering Computer Systems Laboratory at Stanford from Stanford University. University. He leads the VLSI implementation of the Imagine processor. Khailany has an MS John D. Owens is a PhD candidate in elec- in electrical engineering from Stanford Uni- trical engineering at Stanford University, versity and a BS in electrical engineering from where he is an architect of the Imagine stream the University of Michigan. He is a student processor. His research interests are comput- member of the IEEE and the ACM. er architecture and computer graphics. Owens has a BS in electrical engineering and com- William J. Dally is a professor of electrical engi- puter science from the University of Califor- neering and computer science at Stanford Uni- nia at Berkeley and an MS in electrical versity, where his group developed the Imagine engineering from Stanford. processor. He currently leads projects on high- speed signaling, computer architecture, and Brian Towles is a master’s student in electrical network architecture. Dally has a BS in elec- engineering at Stanford. He is a system archi- trical engineering from Virginia Polytechnic tect for the Imagine processor, with research Institute, an MS in electrical engineering from interests in computer architecture and on-chip Stanford University, and a PhD in computer interconnection networks. Towles has a BS in science from Caltech. He is a member of the computer engineering from the Georgia Insti- Computer Society, the Solid-State Circuits tute of Technology. Society, the IEEE, and the ACM. Andrew Chang is a PhD candidate in electri- Ujval J. Kapasi is a PhD candidate in the cal engineering at Stanford. His research inter- Department of Electrical Engineering at Stan- ests include VLSI CAD, computer ford University. He is one of the architects of architecture, and circuit design. Chang received the Imagine stream processor, and his research a BS and an MS degrees from the Massachu- interests lie in parallel computer systems and setts Institute of Technology. He is a student software support for these systems. Kapasi member of the IEEE and the ACM. holds an MS from Stanford University and a BS from Brown University. Scott Rixner is an assistant professor of com- puter science and electrical and computer engi- Peter Mattson is a PhD candidate at Stanford neering at Rice University. Previously, he was University. His research interests include pro- the lead architect for the Imagine stream gramming languages and compilers for high- processor. His interests include computer archi- performance processors. He has a BS in tecture and media processing. Rixner has a BS electrical engineering from the University of and an MS in computer science and engineer- Washington and an MS in electrical engi- ing and a PhD in electrical engineering from neering from Stanford University. He is a the Massachusetts Institute of Technology. He member of the IEEE and the ACM. is a member of Tau Beta Pi and the ACM.

Jinyung Namkoong is working toward the Direct questions and comments about this PhD in electrical engineering at Stanford Uni- article to Brucek Khailany, Stanford Univer- versity. His research interests include VLSI sity, Computer Systems Laboratory, Stanford, circuit techniques, systems architecture, and CA 94305; [email protected].

46 IEEE MICRO