Programmable Stream Processors

COVER FEATURE Programmable Stream Processors Stream processing promises to bridge the gap between inflexible special- purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications. Ujval J. he complexity of modern media processing, putations to occur in parallel—and require mini- Kapasi including 3D graphics, image compression, mal global communication and storage—enabling Stanford University and signal processing, requires tens to hun- data to pass directly from one ALU to the next. A dreds of billions of computations per sec- stream architecture exploits this locality and con- Scott Rixner T ond. To achieve these computation rates, currency by partitioning the communication and Rice University current media processors use special-purpose archi- storage structures to support many ALUs efficiently: tectures tailored to one specific application. Such William J. processors require significant design effort and are • operands for arithmetic operations reside in Dally thus difficult to change as media-processing appli- local register files (LRFs) near the ALUs, in cations and algorithms evolve. much the same way that special-purpose archi- Brucek The demand for flexibility in media processing tectures store and communicate data locally; Khailany motivates the use of programmable processors. • streams of data capture coarse-grained local- Jung Ho Ahn However, very large-scale integration constraints ity and are stored in a stream register file (SRF), Stanford University limit the performance of traditional programmable which can efficiently transfer data to and from architectures. In modern VLSI technology, compu- the LRFs between major computations; and Peter tation is relatively cheap—thousands of arithmetic • global data is stored off-chip only when nec- logic units that operate at multigigahertz rates can essary. Mattson 2 Reservoir Labs fit on a modestly sized 1-cm die. The problem is that delivering instructions and data to those ALUs is pro- These three explicit levels of storage form a data John D. hibitively expensive. For example, only 6.5 percent bandwidth hierarchy with the LRFs providing an Owens of the Itanium 2 die is devoted to the 12 integer and order of magnitude more bandwidth than the SRF two floating-point ALUs and their register files1; com- and the SRF providing an order of magnitude more University of California, Davis munication, control, and storage overhead consume bandwidth than off-chip storage. the remaining die area. In contrast, the more efficient This bandwidth hierarchy is well matched to the communication and control structures of a special- characteristics of modern VLSI technology, as each purpose graphics chip, such as the Nvidia GeForce4, level provides successively more storage and less enable the use of many hundreds of floating-point bandwidth. By exploiting the locality inherent in and integer ALUs to render 3D images. media-processing applications, this hierarchy stores the data at the appropriate level, enabling hundreds STREAM PROCESSING of ALUs to operate at close to their peak rate. In part, such special-purpose media processors Moreover, a stream architecture can support such are successful because media applications have a large number of ALUs in an area- and power-effi- abundant parallelism—enabling thousands of com- cient manner. Modern high-performance micro- 54 Computer Published by the IEEE Computer Society 0018-9162/03/$17.00 © 2003 IEEE Convert while ( ! Input_Image.end() ) { Luminance // input next macroblock Input_Image in = Input_Image.pop(); // generate Luminance and // Chrominance blocks outY[0..3]= gen_L_blocks(in); outC[0..1]= gen_C_blocks(in); Chrominance (a) // output new blocks Luminance.push(outY[0..3]); Chrominance.push(outC[0..1]); } struct MACROBLOCK { struct RGB_pixel { byte r,g,b; stream <MACROBLOCK> struct BLOCK { } stream <BLOCK> byte intensity[8][8]; RGB_pixel pixels[16][16]; } } stream <MACROBLOCK> Input_Image(NUM_MB); stream <BLOCK> Luminance(NUM_MB*4), Chrominance(NUM_MB*2), (b) Input_Image = Video_Feed.get_macroblocks(currpos, NUM_MB); currpos += NUM_MB; Convert(Input_Image, Luminance, Chrominance); Figure 1. Streams processors and digital signal processors continue to the Luminance and Chrominance output and a kernel from to rely on global storage and communication struc- streams, respectively. an MPEG-2 video tures to deliver data to the ALUs; these structures Streams. As Figure 1a shows, streams contain a encoder. (a) The use more area and consume more power per ALU set of elements of the same type. Stream elements Convert kernel than a stream processor. can be simple, such as a single number, or com- translates a stream plex, such as the coordinates of a triangle in 3D of macroblocks STREAMS AND KERNELS space. Streams need not be the same length—for containing RGB The central idea behind stream processing is to example, the Luminance stream has four times as pixels into streams organize an application into streams and kernels many elements as the input stream. Further, of blocks containing to expose the inherent locality and concurrency in Input_Image could contain all of the macroblocks luminance and media-processing applications. In most cases, not in an entire video frame, only a row of mac- chrominance pixels. only do streams and kernels expose desirable prop- roblocks from the frame, or even a subset of a sin- (b) A StreamC erties of media applications, but they are also a nat- gle row. In the stream code in Figure 1b, the value program expresses ural way of expressing the application. This leads of NUM_MB controls the length of the input the flow of streams to an intuitive programming model that maps stream. through kernels directly to stream architectures with tens to hun- Kernels. The Convert kernel consists of a loop textually. dreds of ALUs. that processes each element from the input stream. The body of the loop first pops an element from Example application its input stream, performs some computation on Figure 1 illustrates input and output streams and that element, and then pushes the results onto the a kernel taken from an MPEG-2 video encoder. two output streams. Figure 1a shows how a kernel operates on streams Kernels can have one or more input and output graphically, while Figure 1b shows this process in streams and perform complex calculations ranging a simplified form of StreamC, a stream program- from a few to thousands of operations per input ming language. element—one Convert implementation requires Input_Image is a stream that consists of image 6,464 operations per input macroblock to produce data from a camera. Elements of Input_Image are the six output blocks. The only external data that 16 × 16 pixel regions, or macroblocks, on which a kernel can access are its input and output streams. the Convert kernel operates. The kernel applies For example, Convert cannot directly access the the same computation to the macroblocks in data from the video feed; instead, the data must first Input_Image, decomposing each one into six 8 × 8 be organized into a stream. blocks—four luminance blocks and two 4:1 sub- Full application. A full application, such as the sampled chrominance blocks—and appends them MPEG-2 encoder, is composed of multiple streams August 2003 55 Stream Scalar Kernel Global Luminance IQDCT reference DCTQ Rate control Luminance Input_Image Variable- Video Run-level Convert length Compressed frames encoding Chrominance coding bitstream DCTQ Rate control IQDCT Chrominance reference Figure 2. MPEG-2 and kernels. This application compresses a sequence • Stream. Data are communicated between com- I-frame encoder of video images into a single bitstream consisting of putation kernels explicitly as data streams. In mapped to streams three types of frames: intracoded, predicted, and the MPEG-2 encoder, for example, the and kernels. The bidirectional. The encoder compresses I-frames Luminance and Chrominance streams use this encoder receives using only information contained in the current type of communication. a stream of macro- frame, and it compresses P- and B-frames using • Global. This level of communication is only blocks from a frame information from the current frame as well as addi- for truly global data. This is necessary for com- in the video feed as tional reference frames. municating data to and from I/O devices, as input, and the first For example, Figure 2 shows one possible map- well as for data that must persist throughout kernel (Convert) ping of the portion of the MPEG-2 encoder appli- the application. For example, the MPEG-2 I- processes this cation that encodes only I-frames into the stream- frame encoder uses this level of communica- stream. The processing model. Solid arrows represent data tion for the original input data from a video discrete cosine streams, and ovals represent computation kernels. feed and for the reference frames, which must transform (DCT) The encoder receives a stream of macroblocks persist throughout the processing of multiple kernels then from the video feed as input (Input_Image), and the video frames in the off-chip dynamic RAM operate on the first kernel (Convert) processes this input. The dis- (DRAM). output streams crete cosine transform and quantization (DCTQ) produced by kernels then operate on the output streams pro- By requiring programmers to explicitly use the Convert. duced by Convert. Upon execution of all the com- appropriate type of communication for each data Q = quantization; putation kernels, the application

Load more