Imagine: Media Processing with Streams
Total Page:16
File Type:pdf, Size:1020Kb
IMAGINE: MEDIA PROCESSING WITH STREAMS THE POWER-EFFICIENT IMAGINE STREAM PROCESSOR ACHIEVES PERFORMANCE DENSITIES COMPARABLE TO THOSE OF SPECIAL-PURPOSE EMBEDDED PROCESSORS. EXECUTING PROGRAMS MAPPED TO STREAMS AND KERNELS, A SINGLE IMAGINE PROCESSOR IS EXPECTED TO HAVE A PEAK PERFORMANCE OF 20 GFLOPS AND SUSTAIN 18.3 GOPS ON MPEG-2 ENCODING. Media-processing applications, such Conventional general-purpose architectures as signal processing, 2D- and 3D-graphics ren- don’t efficiently exploit the available data par- dering, and image and audio compression and allelism in media applications. Their memo- decompression, are the dominant workloads ry systems depend on caches optimized for Brucek Khailany in many systems today. The real-time con- reducing latency and data reuse. Finally, they straints of media applications demand large don’t scale to the numbers of arithmetic units William J. Dally amounts of absolute performance and high or registers required to support a high ratio of performance densities (performance per unit computation to memory access. In contrast, Ujval J. Kapasi area and per unit power). Therefore, media- special-purpose architectures take advantage processing applications often use special- of these characteristics because they effective- Peter Mattson purpose (custom), fixed-function hardware. ly exploit data parallelism and computation- General-purpose solutions, such as program- al intensity with a large number of arithmetic Jinyung Namkoong mable digital signal processors (DSPs), offer units. Also, special-purpose processors direct- increased flexibility but achieve performance ly map the algorithm’s dataflow graph into John D. Owens density levels two or three orders of magnitude hardware rather than relying on memory sys- worse than special-purpose systems. tems to capture locality. Brian Towles One reason for this performance density gap Another reason for the performance densi- is that conventional general-purpose architec- ty gap is the constraints of modern technolo- Andrew Chang tures are poorly matched to the specific prop- gy. Modern VLSI computing systems are erties of media applications. These applications limited by communication bandwidth rather Stanford University share three key characteristics. First, operations than arithmetic. For example, in a contem- on one data element are largely independent porary 0.15-micron CMOS technology, a 32- of operations on other elements, resulting in bit integer adder requires less than 0.05 mm2 Scott Rixner a large amount of data parallelism and high of chip area. Hundreds to thousands of these latency tolerance. Second, there is little glob- arithmetic units fit on an inexpensive 1-cm2 Rice University al data reuse. Finally, the applications are com- chip. The challenge is supplying them with putationally intensive, often performing 100 instructions and data. General-purpose to 200 arithmetic operations for each element processors that rely on global structures such read from off-chip memory. as large multiported register files to provide 0272-1732/01/$10.00 2001 IEEE 35 Imagine Input data Kernel Stream Output data Right Convolve Convolve camera image 7×7 3×3 Depth SAD map Left Convolve Convolve camera image 7×7 3×3 Figure 1. Stereo depth extraction. data bandwidth cannot scale to the number as streams and expresses all computation as of arithmetic logic units (ALUs) required for kernels. A stream is a sequence of similar ele- high performance rates in media applications. ments. Each stream element is a record, such In contrast, special-purpose processors have as 21-word triangles or 8-bit pixels. A kernel many ALUs connected by dedicated wires and consumes a set of input streams, performs a buffers to provide needed data and control computation, and produces a set of output bandwidth. However, special-purpose solu- streams. Streams passing among multiple com- tions lack the flexibility to work effectively on putation kernels form a stream program. a wide application space. Figure 1 shows how we map stereo depth To provide programmability yet maintain extraction,1 a typical image-processing appli- high performance densities, we developed the cation, to the stream programming model. Imagine stream processor at Stanford Uni- Stereo depth extraction calculates the dispar- versity. Imagine consists of a programming ity between objects in a left and right gray- model, software tools, and an architecture, all scale camera image pair. It outputs a depth designed to operate directly on streams. The map corresponding to each object’s distance stream programming model exposes the local- from the cameras (in the example depth map, ity and concurrency in media applications so brighter pixels are closer to the camera). The they can be exploited by the stream architec- input camera images are formatted as streams; ture. This architecture uses a novel register file in the example, a single stream contains a row organization that supports 48 floating-point of pixels from the image. The row streams arithmetic units. We expect a prototype Imag- flow though two convolution kernels, which ine processor to achieve 18.3 giga operations produce a filtered stream. The filtered streams per second (GOPS) in MPEG-2 encoding then pass through a sum-of-absolute-differ- applications, corresponding to 105 frames per ences (SAD) kernel, which produces a stream second on a 720 × 480-pixel, 24-bit color containing a row of the final depth map. Each image while dissipating 2.2 W. kernel also outputs a stream of partial sums (represented by the circular arrows in the dia- Stream programming model gram) needed by the same kernel as it process- The stream programming model allows sim- es future rows. ple control, makes communication explicit, Stereo depth extraction possesses important and exposes the inherent parallelism of media characteristics common to all media-process- applications. A stream program organizes data ing applications: 36 IEEE MICRO Imagine processor Stream Host controller Microcontroller Arithmetic SDRAM cluster SDRAM Arithmetic cluster SDRAM Stream register file SDRAM Streaming memory Arithmetic Network cluster devices Network Local (other interface register Imagines file or I/O) 2.67 Gbytes/s 32 Gbytes/s 544 Gbytes/s Figure 2. Imagine stream architecture. • Little data reuse. Pixels are read once from • Store transfers streams from the SRF to memory and are not revisited. off-chip DRAM. • High data parallelism. The same set of • Receive transfers streams from the net- operations independently computes all work to the SRF. pixels in the output image. • Send transfers streams from the SRF to • Computationally intensive. The applica- the network. tion requires 60 arithmetic operations per • Cluster op executes a kernel in the arith- memory reference. metic clusters that reads inputs streams from the SRF, computes output streams, Mapping an application to streams and ker- and writes the output streams to the SRF. nels exposes these characteristics so that hard- • Load microcode loads streams consisting ware can easily exploit them. Streams expose of kernel microcode—576-bit very long an application’s inherent data parallelism. The instruction word (VLIW) instructions— flow of streams between kernels exposes the from the SRF into the microcontroller application’s communication requirements. instruction store (a total of 2,048 instructions). Stream architecture Imagine is a programmable processor that Imagine supports streams up to 32K directly executes applications mapped to streams (32,768) words long. The host processor and kernels. Figure 2 diagrams the Imagine sends stream instructions to the stream con- stream architecture. The processor consists of a troller. When instructions are ready for issue, 128-Kbyte stream register file (SRF), 48 float- the stream controller decodes them and issues ing-point arithmetic units in eight arithmetic the necessary commands to other on-chip clusters controlled by a microcontroller, a net- modules. work interface, a streaming memory system Each arithmetic cluster, detailed on the with four SDRAM channels, and a stream con- right side of Figure 2, contains eight func- troller. Imagine is controlled by a host proces- tional units. A small two-ported local register sor, which sends it stream instructions. The file (LRF) connects to each input of each func- following are the main stream instructions: tional unit. An intracluster switch connects the outputs of the functional units to the • Load transfers streams from off-chip inputs of the LRFs. During kernel execution, DRAM to the SRF. the same VLIW instruction is broadcast to all MARCH–APRIL 2001 37 Imagine eight clusters in single-instruction, multiple- to the pixel elements’ locations, which are laid data (SIMD) fashion. out sequentially in memory. The memory sys- Kernels typically loop through all input- tem then reorders these accesses to maximize stream elements, performing a compound DRAM bandwidth. As the Load operation stream operation on each element. A com- executes, elements are written back to the SRF pound stream operation reads an element in their original order. At completion of the from its input stream(s) in the SRF and com- operation, the SRF contains a stream consist- putes a series of arithmetic operations (for ing of all the pixels in a row. example, matrix multiplication or convolu- tion). Then it appends the results to output Convolve7×7 kernel execution. Next, the streams that are transferred back to the SRF. stream controller issues pseudocode line 4, A compound stream operation stores all tem- causing the convolve7×7 kernel (Figure 3c) to porary data in the LRFs. It sends only final execute. This kernel has two input streams: results to the SRF. the pixel stream just loaded from memory Compound stream operations are com- (src) and a stream containing the partial sums pound in the sense that they perform multi- produced earlier in the application (old_par- ple arithmetic operations on each stream tials7). During kernel execution, the micro- element. In contrast, conventional vector controller fetches VLIW microcode operations perform a single arithmetic oper- instructions from the instruction store every ation on each vector element and store the cycle.