COVER FEATURE Programmable Processors

Stream processing promises to bridge the gap between inflexible special- purpose solutions and current programmable architectures that cannot meet the computational demands of media-processing applications.

Ujval J. he complexity of modern media processing, putations to occur in parallel—and require mini- Kapasi including 3D graphics, image compression, mal global communication and storage—enabling and signal processing, requires tens to hun- data to pass directly from one ALU to the next. A dreds of billions of computations per sec- stream architecture exploits this locality and con- Scott Rixner T ond. To achieve these computation rates, currency by partitioning the communication and Rice University current media processors use special-purpose archi- storage structures to support many ALUs efficiently: tectures tailored to one specific application. Such William J. processors require significant design effort and are • operands for arithmetic operations reside in Dally thus difficult to change as media-processing appli- local register files (LRFs) near the ALUs, in cations and algorithms evolve. much the same way that special-purpose archi- Brucek The demand for flexibility in media processing tectures store and communicate data locally; Khailany motivates the use of programmable processors. • streams of data capture coarse-grained local- Jung Ho Ahn However, very large-scale integration constraints ity and are stored in a stream register file (SRF), Stanford University limit the performance of traditional programmable which can efficiently transfer data to and from architectures. In modern VLSI technology, compu- the LRFs between major computations; and Peter tation is relatively cheap—thousands of arithmetic • global data is stored off-chip only when nec- logic units that operate at multigigahertz rates can essary. Mattson 2 Reservoir Labs fit on a modestly sized 1-cm die. The problem is that delivering instructions and data to those ALUs is pro- These three explicit levels of storage form a data John D. hibitively expensive. For example, only 6.5 percent bandwidth hierarchy with the LRFs providing an Owens of the 2 die is devoted to the 12 integer and order of magnitude more bandwidth than the SRF two floating-point ALUs and their register files1; com- and the SRF providing an order of magnitude more University of California, Davis munication, control, and storage overhead consume bandwidth than off-chip storage. the remaining die area. In contrast, the more efficient This bandwidth hierarchy is well matched to the communication and control structures of a special- characteristics of modern VLSI technology, as each purpose graphics chip, such as the GeForce4, level provides successively more storage and less enable the use of many hundreds of floating-point bandwidth. By exploiting the locality inherent in and integer ALUs to render 3D images. media-processing applications, this hierarchy stores the data at the appropriate level, enabling hundreds STREAM PROCESSING of ALUs to operate at close to their peak rate. In part, such special-purpose media processors Moreover, a stream architecture can support such are successful because media applications have a large number of ALUs in an area- and power-effi- abundant parallelism—enabling thousands of com- cient manner. Modern high-performance micro-

54 Computer Published by the IEEE Computer Society 0018-9162/03/$17.00 © 2003 IEEE Convert

while ( ! Input_Image.end() ) { Luminance // input next macroblock Input_Image in = Input_Image.pop();

// generate Luminance and // Chrominance blocks outY[0..3]= gen_L_blocks(in); outC[0..1]= gen_C_blocks(in); Chrominance (a) // output new blocks Luminance.push(outY[0..3]); Chrominance.push(outC[0..1]); }

struct MACROBLOCK { struct RGB_pixel { byte r,g,b; stream struct BLOCK { } stream byte intensity[8][8]; RGB_pixel pixels[16][16]; } }

stream Input_Image(NUM_MB); stream Luminance(NUM_MB*4), Chrominance(NUM_MB*2), (b) Input_Image = Video_Feed.get_macroblocks(currpos, NUM_MB); currpos += NUM_MB; Convert(Input_Image, Luminance, Chrominance);

Figure 1. Streams processors and digital signal processors continue to the Luminance and Chrominance output and a kernel from to rely on global storage and communication struc- streams, respectively. an MPEG-2 video tures to deliver data to the ALUs; these structures Streams. As Figure 1a shows, streams contain a encoder. (a) The use more area and consume more power per ALU set of elements of the same type. Stream elements Convert kernel than a stream . can be simple, such as a single number, or com- translates a stream plex, such as the coordinates of a triangle in 3D of macroblocks STREAMS AND KERNELS space. Streams need not be the same length—for containing RGB The central idea behind stream processing is to example, the Luminance stream has four times as pixels into streams organize an application into streams and kernels many elements as the input stream. Further, of blocks containing to expose the inherent locality and concurrency in Input_Image could contain all of the macroblocks luminance and media-processing applications. In most cases, not in an entire video frame, only a row of mac- chrominance pixels. only do streams and kernels expose desirable prop- roblocks from the frame, or even a subset of a sin- (b) A StreamC erties of media applications, but they are also a nat- gle row. In the stream code in Figure 1b, the value program expresses ural way of expressing the application. This leads of NUM_MB controls the length of the input the flow of streams to an intuitive programming model that maps stream. through kernels directly to stream architectures with tens to hun- Kernels. The Convert kernel consists of a loop textually. dreds of ALUs. that processes each element from the input stream. The body of the loop first pops an element from Example application its input stream, performs some computation on Figure 1 illustrates input and output streams and that element, and then pushes the results onto the a kernel taken from an MPEG-2 video encoder. two output streams. Figure 1a shows how a kernel operates on streams Kernels can have one or more input and output graphically, while Figure 1b shows this in streams and perform complex calculations ranging a simplified form of StreamC, a stream program- from a few to thousands of operations per input ming language. element—one Convert implementation requires Input_Image is a stream that consists of image 6,464 operations per input macroblock to produce data from a camera. Elements of Input_Image are the six output blocks. The only external data that 16 × 16 pixel regions, or macroblocks, on which a kernel can access are its input and output streams. the Convert kernel operates. The kernel applies For example, Convert cannot directly access the the same computation to the macroblocks in data from the video feed; instead, the data must first Input_Image, decomposing each one into six 8 × 8 be organized into a stream. blocks—four luminance blocks and two 4:1 sub- Full application. A full application, such as the sampled chrominance blocks—and appends them MPEG-2 encoder, is composed of multiple streams

August 2003 55 Stream Scalar Kernel Global Luminance IQDCT reference

DCTQ Rate control Luminance

Input_Image Variable- Video Run-level Convert length Compressed frames encoding Chrominance coding bitstream

DCTQ Rate control

IQDCT Chrominance reference

Figure 2. MPEG-2 and kernels. This application compresses a sequence • Stream. Data are communicated between com- I-frame encoder of video images into a single bitstream consisting of putation kernels explicitly as data streams. In mapped to streams three types of frames: intracoded, predicted, and the MPEG-2 encoder, for example, the and kernels. The bidirectional. The encoder compresses I-frames Luminance and Chrominance streams use this encoder receives using only information contained in the current type of communication. a stream of macro- frame, and it compresses P- and B-frames using • Global. This level of communication is only blocks from a frame information from the current frame as well as addi- for truly global data. This is necessary for com- in the video feed as tional reference frames. municating data to and from I/O devices, as input, and the first For example, Figure 2 shows one possible map- well as for data that must persist throughout kernel (Convert) ping of the portion of the MPEG-2 encoder appli- the application. For example, the MPEG-2 I- processes this cation that encodes only I-frames into the stream- frame encoder uses this level of communica- stream. The processing model. Solid arrows represent data tion for the original input data from a video discrete cosine streams, and ovals represent computation kernels. feed and for the reference frames, which must transform (DCT) The encoder receives a stream of macroblocks persist throughout the processing of multiple kernels then from the video feed as input (Input_Image), and the video frames in the off-chip dynamic RAM operate on the first kernel (Convert) processes this input. The dis- (DRAM). output streams crete cosine transform and quantization (DCTQ) produced by kernels then operate on the output streams pro- By requiring programmers to explicitly use the Convert. duced by Convert. Upon execution of all the com- appropriate type of communication for each data Q = quantization; putation kernels, the application either transmits element, the stream model expresses the applica- IQ = inverse the compressed bitstream over a network or saves tion’s inherent locality. For example, the model only quantization. it for later use. The application uses the two refer- uses streams to transfer data between kernels and ence streams to compress future frames. The repre- does not burden them with temporary values gen- sentation of this entire application in a language erated within kernels. Likewise, the model does not such as StreamC is a straightforward extension of use global communication for temporary streams. the code shown in Figure 1b. The stream model also exposes concurrency in media-processing applications at multiple levels: Locality and concurrency By making communication between computa- • Instruction-level parallelism (ILP). As in tradi- tion kernels explicit, the stream-processing model tional processing models, the stream model can exposes both the locality and concurrency inher- exploit parallelism between the scalar opera- ent in media-processing applications. tions in a kernel function. For example, the The model exposes locality by organizing com- operations in gen_L_blocks and gen_C_blocks munication into three distinct levels: can occur in parallel. • . Because kernels apply the • Local. Temporary results that only need to be same computation to each element of an input transferred between scalar operations within a stream, the stream model can exploit data par- kernel use local communication mechanisms. allelism by operating on several stream ele- For example, temporary values in gen_L_blocks ments at the same time. For example, the are only referenced within Convert. This type model parallelizes the main loop in Convert so of communication cannot be used to transfer that multiple computation elements can each data between two different kernels. decompose a different macroblock.

56 Computer Application processor

Kernel • . Multiple computation tasks, including kernel execution and stream data transfers, can execute concurrently as long as Stream register DRAM DRAM they obey dependencies in the stream graph. file interface For example, in the MPEG-2 I-frame appli- Other cation, the two DCTQ kernels could run in stream parallel. clients

Languages such as StreamC and StreamIt2 that implement the stream model let programmers Figure 3. Baseline express communication explicitly at each level. A The DRAM interface supports two stream-level programmable compiler can easily extract and optimize the local- instructions—load_stream and store_stream—that stream processor. ity and concurrency, as it does not need to discover transfer an entire stream between off-chip DRAM The SRF streams them on its own.3 and the SRF. Additional DRAM interface argu- data between ments can specify noncontiguous access patterns, clients—in this STREAM INSTRUCTION-SET ARCHITECTURE such as nonunit strides and indirect access. The case the KEU, which In a conventional scalar instruction set, commu- DRAM interface client also supports optional is responsible for nication is implicit, making it difficult to exploit caches to increase access bandwidth. executing kernels locality, and computation is expressed sequentially, The programmable KEU supports two stream- and providing local making it difficult to exploit concurrency. In con- level instructions as well. The load_kernel instruc- communication for trast, a stream instruction set explicitly communi- tion loads a compiled kernel function into local operations within cates to the hardware the locality and concurrency instruction storage within the KEU. This typically the kernels, and that the stream model exposes. occurs only the first time a kernel is executed; on the off-chip DRAM Figure 3 shows the architecture of a baseline subsequent kernel invocations, the code is already interface, which programmable stream processor, which consists available in the KEU. The run_kernel instruction provides access to of an application processor, a stream , causes the KEU to start executing instructions that storage for global and stream clients. The SRF serves as a commu- are encoded in its own instruction-set architecture, data. nication link by streaming data between clients, which is distinct from the application processor ISA. as long as the data does not exceed its storage capacity. The two stream clients in the baseline Kernel execution unit ISA stream processor—a programmable kernel exe- Instructions in the KEU ISA control the func- cution unit and an off-chip DRAM interface— tional units and register storage within the KEU, either consume data streams from the SRF or similar to typical RISC instructions. However, produce data streams to the SRF. The KEU exe- unlike a RISC ISA, cutes kernels and provides local communication for operations within the kernels, while the off- • KEU instructions do not have access to arbi- chip DRAM interface provides access to global trary memory locations, as all external data data storage. must be read from or written to streams; and The architecture can also support other stream • special communication instructions explicitly clients, including interfaces to I/O devices, an inter- handle data dependencies between the com- face to an off-chip interconnection network, or spe- putations of different output elements. cialized kernel units such as a variable-length Huffman encoder. Multiple instances of any par- The first constraint preserves locality, while the ticular stream client are also possible. communication mechanisms make it easier to exploit concurrency in applications that would oth- Application processor ISA erwise need to reorganize data through the memory The application processor executes application system. Such an ISA maximizes stream architecture code like that in Figure 1b. The application proces- performance. sor’s RISC-like instruction set is augmented with stream-level instructions to control the flow of data Stream ISA streams through the system. The application A stream processor’s ISA encapsulates both the processor sequences these instructions and issues application processor ISA and the KEU ISA. them to the stream clients, including the DRAM Accordingly, a compiler translates high-level stream interface and the KEU. languages directly to the stream processor’s ISA.

August 2003 57 Imagine processor

Application Stream Kernel execution unit processor controller (and instruction store)

SDRAM ALU cluster

ALU cluster SDRAM Stream DRAM register interface file SDRAM

SDRAM ALU cluster

Network devices (other Imagines Network or I/O) interface LRF

16.9 Gbps 92.2 Gbps 1,570 Gbps

Figure 4. Imagine For example, it compiles the Convert kernel code form manipulations on simple scalar operations stream processor. shown in Figure 1a to the KEU ISA offline. The that tend to benefit local performance only. A The on-chip compiler then compiles the application-level code stream compiler, on the other hand, can manipu- bandwidths shown in Figure 1b to the application processor late instructions that operate on entire streams, correspond to an ISA. The application processor executes this code potentially leading to greater performance benefits. operating frequency and moves the precompiled kernel code to the KEU To be a good compiler target for high-level stream of 180 MHz. instruction memory during execution. languages, the stream processor’s ISA must express Assuming that the portion of the video feed the the original program’s locality and concurrency. To processor is encoding currently resides in external this end, three distinct address spaces in the stream DRAM, the three lines of application code in Figure ISA support the three types of communication that 1b would result in the following sequence of oper- the stream model uses. The ISA maps local com- ations (ii = Input_Image, lum = Luminance, and munication within kernels to the address space of chrom = Chrominance): the local registers within the KEU, stream commu- nication to the SRF address space, and global com- load_stream ii_SRF_loc, munication to the off-chip DRAM address space. video_DRAM_loc, LEN To preserve the stream model’s locality and con- currency, the processor’s ISA also exploits the three add video_DRAM_loc, types of parallelism. To exploit ILP, the processor video_DRAM_loc, LEN provides multiple functional units within the KEU. Because a kernel applies the same function to all run_kernel convert_KEU_loc, stream elements, the compiler can exploit data par- ii_SRF_loc, LEN, allelism using loop unrolling and software pipelin- lum_SRF_loc, ing on kernel code. The KEU can exploit additional chrom_SRF_loc data parallelism using parallel hardware. To exploit task parallelism, the processor supports multiple The first instruction loads a portion of the raw stream clients connected to the SRF. video feed into the SRF. The second instruction is a regular scalar operation that only updates local PUTTING IT ALL TOGETHER: IMAGINE state in the application processor. The final instruc- Imagine (http://cva.stanford.edu/imagine/), de- tion causes the KEU to start executing the opera- signed and prototyped at Stanford University, is the tions for the Convert kernel. Similar instructions first programmable streaming-media processor that are required to complete the rest of the MPEG-2 implements a stream instruction set.4 At a controlled I-frame application . voltage and temperature, the chips operate at 288 Interestingly, a traditional compiler can only per- MHz, at which speed the peak performance is 11.8

58 Computer Figure 5. Imagine die layout. The ALU clusters consume 26 percent of the die area, the SRF consumes 16 billion 32-bit floating-point operations per second, percent, and the or 23.0 billion 16-bit operations per second. How- memory interface, ever, in its current experimental setup the processor the network inter- is seated in a socket to aid debugging, which limits face, and control performance to 180 MHz. At the most energy- and routing efficient operating voltage, 1.2 V, the prototype consume the rest. achieves 4.7 billion 16-bit operations per joule. HI = host interface; The architecture, shown in Figure 4, contains all SC = stream the basic stream hardware of the baseline proces- controller. sor, including an application processor, an SRF, and stream clients. The stream clients include a KEU and DRAM interface, as well as a network inter- face. The KEU consists of eight arithmetic clusters that are controlled by a single microcontroller in a communication in the stream model—local, single-instruction, multiple-data (SIMD) fashion. stream, and global—to these three tiers. The local- Imagine supports 48 ALUs on a 2.6-cm2 die in a ity in media applications ensures that most of the 1.5-V, 0.15-µm complementary metal-oxide semi- data bandwidth demand will be for the LRFs, conductor process using a full much less will be for the SRF, and only the accesses standard- design flow. The 32-bit ALUs can per- to persistent data structures will access off-chip form both integer and floating-point operations. DRAM. Thus, the bandwidth hierarchy on Figure 5 shows the layout of the 21-million-tran- Imagine is well suited to exploit the application sistor die. The ALU clusters consume 26 percent of locality captured by the stream model. the die, the SRF consumes 16 percent, and the The bandwidths achieved for the three levels of memory interface, the network interface, and con- the hierarchy for the MPEG-2 encoder demonstrate trol and routing consume the rest of the chip area. the amount of locality Imagine exploits: 592 Gbps To simplify the design and implementation of the via the LRFs, 6.47 Gbps via the SRF, and 1.26 Imagine architecture, the prototype system splits Gbps via DRAM. This ratio of 470:5:1 means that the functionality of the application processor into 98.7 percent of all data accesses are captured two portions. The on-chip stream controller (SC) locally in the LRFs, and only 0.21 percent of the only handles the sequencing of operations to the accesses use off-chip DRAM. stream clients. An external processor executes the Further, Imagine does not require any extra tran- application-level code and hands off any operations sistors or waste any energy to dynamically capture destined for stream clients to the stream controller this locality, such as with a cache or bypass net- through the host interface (HI). work. The original stream program expressed the locality and the ISA encoded it. Bandwidth hierarchy Note that these results are for the full MPEG-2 The stream model exploits locality and concur- encoding application, which includes encoding rency while satisfying VLSI technology constraints. both I- and P-frames, but without variable-length In modern VLSI, global communication is far more Huffman encoding. Huffman encoding’s sequen- expensive than local communication. For example, tial nature makes it better suited for the applica- the maximum data bandwidth provided by a global tion processor. register file is an order of magnitude higher than the bandwidth provided by off-chip memory. SIMD clusters Further, the total bandwidth provided by tens to Imagine also successfully exploits the concurrency hundreds of local memories, each servicing a few of stream programs at several levels. As Figure 4 nearby ALUs, is another one to two orders of mag- shows, a single microcontroller controls multiple nitude greater. clusters in a SIMD fashion. The SIMD clusters are Accordingly, as Figure 4 shows, Imagine provides particularly effective at capturing data parallelism. three levels of storage that form a data bandwidth For example, Imagine executes the Convert kernel hierarchy. The top tier consists of 9.5 Kbytes in a by processing a different macroblock in each clus- set of distributed local register files (LRFs) in the ter. Again, the processor does not require extra KEU (1,570 Gbps), followed by the 128-Kbyte SRF hardware or energy to discover this parallelism, as (92.2 Gbps) and the off-chip DRAM interface (16.9 it is present in the stream program. By structuring Gbps). The stream ISA maps the three types of the original computation as a kernel, the program-

August 2003 59 Table 1. Performance and power dissipation of the Imagine stream processor. 180 MHz, 2.0 V 96 MHz, 1.2 V Performance Performance (giga operations Core power (giga operations Core power Application Operation types per second) (watts) per second) (watts)

MPEG-2 encode 32-, 16-, and 8-bit integer 6.2 8.0 3.5 1.6 QR decomposition 32-bit floating point 4.2 9.8 2.2 1.9 Coherent side-lobe cancellation 32-bit floating point 2.8 10.0 1.4 1.9

mer makes it explicit that the processor can produce an order of magnitude worse than Imagine when each output element in parallel. normalized to the same technology. A good measure of the amount of data parallelism Digital signal processors (DSPs), on the other that Imagine can successfully exploit is the hand, have kept clock rates and pipeline lengths of an application running on an eight-cluster small for power efficiency, but this limits their peak Imagine compared to an identical stream processor performance. For example, the 225-MHz Texas with a single cluster. To take an example from the Instruments C67x provides a peak of 1.35 Gflops6 MPEG-2 application, an 8 × 8 DCT kernel runs 7.7 and, when normalized to the same process tech- times faster. Imagine can also exploit data paral- nology, is two to three times less power efficient lelism via 8- and 16-bit parallel subword operations. than Imagine. Moreover, using more aggressive cir- To exploit instruction-level parallelism, each clus- cuit designs and power-saving techniques would ter contains six ALUs controlled by a compiled very significantly improve Imagine’s power efficiency. long instruction word (VLIW) kernel program. The Finally, not only does a stream architecture effi- processor’s ILP can be measured by the number of ciently support 48 ALUs in current VLSI technol- instructions executed every cycle within each cluster. ogy, it will also be able to effectively scale to much The MPEG-2 application achieves 4.4 operations per higher levels of performance in the near future. This cycle per cluster. is especially true compared to DSP and even vector Finally, Imagine exploits task parallelism on-chip architectures, as the “Stream versus Vector because it can transfer two streams via the DRAM Processors” sidebar describes. A recent study shows interface and execute a kernel concurrently. that programmable stream processors with a Compared to a stream processor that does not sup- similar to Imagine will be able to port this concurrency, Imagine executes the MPEG- support more than 1,000 ALUs by 2007.7 Such a 2 application 1.2 times faster and a coherent processor would have a peak performance of more side-lobe cancellation application 1.7 times faster. than 1 trillion operations per second while dissi- pating less than 10 watts. Processor performance Table 1 demonstrates the performance of the Imagine stream processor on three typical media he Imagine media processor validates the applications. Performance was measured at an oper- hypothesis that careful management of band- ating frequency of 180 MHz, and at a more power- T width and parallelism, from the programming efficient operating point of 1.2 V at 96 MHz. At language to the hardware, results in both high per- 180 MHz, Imagine achieves between 2.8 and 6.2 formance and high performance per unit of power. billion operations per second on actual applications. We envision that the future study of stream proces- For floating-point applications, this is approximately sors will follow three major directions: increasing 40 to 50 percent of the peak performance of 7.4 the number of application domains that benefit Gflops. At the more efficient operating point, the from stream architectures’ high performance and processor achieves a power efficiency between 0.7 power efficiency, automatic mapping of the stream and 2.2 giga operations per joule. model to multinode systems, and improving high- Imagine achieves a maximum power efficiency level stream languages and compilers. of 2.4 Gflops per watt, measured on an application Research into advanced compiler technology for exercising peak performance, which compares stream languages and architectures is being con- favorably to other programmable floating-point ducted as part of Reservoir Labs’ R-Stream compiler processors. General-purpose rely project (www.reservoir.com/r-stream.php). Further on deep pipelining for high clock frequencies and research that extends Imagine’s processing capabil- to achieve high performance ities is also under way as part of Stanford University’s rates, resulting in low power efficiencies. For exam- Merrimac project (http://merrimac.stanford.edu/). ple, at 3.08 GHz, the Pentium 4 achieves a peak The goals of this streaming supercomputer8 are to performance of 12 Gflops at 80 watts,5 more than demonstrate that stream processing is well suited for

60 Computer Stream versus Vector Processors

Stream processors share with vector processors the ability LRFs require only a modest amount of capacity, allowing them to hide latency, amortize instruction overhead, and expose data to be built with high aggregate bandwidth to support a large parallelism by operating on large aggregates of data. Vector number of arithmetic logic units. Because the SRF is relieved of processors such as the Cray family (the Cray 11 through the X- the task of forwarding data to and from the ALUs, its band- 1) and the recent VIRAM1,2 an implementation of the Berke- width is an order of magnitude less than the LRFs, making it ley Intelligent RAM project, use vector instructions to load, economical to build SRFs large enough to exploit coarse- store, and operate on a fixed-length vector. The length of each grained locality. vector is VL data words. Second, a stream-processing kernel performs all of the oper- Loading a vector with a single instruction hides memory ations for one record of the input before moving on to the next latency with parallelism by loading the vector’s remaining words record. This reduces the number of intermediate variables while waiting for the first to return from memory. Amortizing stored at any single moment in the LRFs by a factor of VL— the costs of instruction fetch, issue, and decode over VL oper- strictly speaking, VL divided by the number of clusters. ations reduces overhead. Having each instruction specify VL A stream processor starts with the latency hiding and instruc- operations, some of which may be performed in parallel on mul- tion efficiency of a . By adding a level of regis- tiple vector pipes, exposes data parallelism. ter hierarchy and a level of instruction sequencing, it can reduce In a similar manner, a stream processor such as Imagine hides bandwidth demands on the memory system by an order of mag- memory latency by fetching a stream of records with a single nitude for many applications and thus can support an order of stream load instruction. It also executes a kernel on entire magnitude more ALUs than a vector processor with the same streams of records, amortizing the overhead of the run_kernel memory bandwidth. instruction and exposing data parallelism that can be exploited by operating on records in parallel using multiple processing References clusters. 1. R.M. Russell, “The Cray-1 Computer System,” Comm. ACM, vol. Stream processors extend vector processors’ capabilities in 21, no. 1, 1978, pp. 63-72. two ways. First, they add a layer to the register hierarchy by 2. C. Kozyrakis, A Media-Enhanced Vector Architecture for Embed- splitting the functions of a vector processor’s vector register file ded Memory Systems, tech. report UCB-CSD-99-1059, Computer between the local register files and the stream register file. The Science Division, Univ. of California, Berkeley, 1999.

scientific computing applications, that stream- 5. D. Sager et al., “A 0.18µm CMOS IA32 Micro- processing hardware and software systems can effec- processor with a 4-GHz Integer Execution Unit,” tively scale to 16,384 nodes, and that a streaming 2001 IEEE Int’l Solid-State Circuits Conf. Digest of architecture substrate for large-scale scientific Technical Papers, IEEE Standards Office, 2001, pp. machines will provide more than an order of mag- 324-325. nitude improvement in performance per cost over 6. Texas Instruments, TMS320C6713 Floating-Point what today’s symmetric multiprocessor clusters , Datasheet SPRS186D, Dec. deliver. This new research promises to improve the 2001, rev. May 2003; http://focus.ti.com/lit/ds/ state of the art in high-end scientific computing, as symlink/tms320c6713.pdf. the Imagine processor has done in the media- 7. B. Khailany et al., “Exploring the VLSI of processing domain. Stream Processors,” Proc. 9th Int’l Symp. High- Performance Computer Architecture, IEEE CS Press, 2003, pp. 153-164. References 8. W.J. Dally, P. Hanrahan, and R. Fedkiw, A Streaming 1. S.D. Naffziger et al., “The Implementation of the Ita- , white paper, Computer Systems Lab- nium 2 ,” IEEE J. Solid-State Cir- oratory, Stanford Univ., Stanford, Calif., 2001. cuits, Nov. 2002, pp. 1448-1460. 2. W. Thies, M. Karczmarek, and S. Amarasinghe, Ujval J. Kapasi is a PhD candidate in electrical “StreamIt: A Language for Streaming Applications,” engineering at Stanford University. His research Proc. 11th Int’l Conf. Compiler Construction, LNCS interests include computer architecture, scientific 2304, Springer-Verlag, 2002, pp. 179-196. computing, and language and compiler design. 3. U.J. Kapasi et al., Stream Scheduling, Concurrent Kapasi received an MS in electrical engineering VLSI Architecture Tech. Report 122, Computer Sys- from Stanford University. He is a member of the tems Laboratory, Stanford Univ., Stanford, Calif., IEEE and the ACM. Contact him at ujk@cva. 2002. stanford.edu. 4. S. Rixner et al., “A Bandwidth-Efficient Architecture for Media Processing,” Proc. 31st Ann. ACM/IEEE Scott Rixner is an assistant professor of computer Int’l Symp. Microarchitecture, IEEE CS Press, 1998, science and electrical and computer engineering at pp. 3-13. Rice University, where he is a co-leader of the Rice

August 2003 61 Computer Architecture Group. His research inter- a member of the IEEE and the ACM. Contact him ests include media, network, and communications at [email protected]. processing; memory system architecture; and the interaction between operating systems and network Jung Ho Ahn is a PhD candidate in electrical engi- server architectures. Rixner received a PhD in elec- neering at Stanford University. His research inter- trical engineering from the Massachusetts Institute ests include computer architecture, stream com- of Technology. He is a member of the ACM. Con- puting, compiler design, and advanced memory tact him at [email protected]. design. Ahn received an MS in electrical engineering from Stanford University. He is a student member William J. Dally is a professor of electrical engi- of the IEEE. Contact him at [email protected]. neering and computer science and a member of the Computer Systems Laboratory at Stanford Uni- Peter Mattson is a managing engineer at Reservoir versity, where he heads the Concurrent VLSI Archi- Labs, an advanced technology services company tecture Group. His research interests include net- with offices in New York City and Portland, Ore. work architecture, multicomputer architecture, His research interests include high-performance media-processor architecture, and high-speed languages and compilers. Mattson received a PhD CMOS signaling. Dally received a PhD in com- in electrical engineering from Stanford University. puter science from the California Institute of Tech- He is a member of the IEEE and the ACM. Contact nology. He is a Fellow of the IEEE, a Fellow of the him at [email protected]. ACM, and a recipient of the ACM Maurice Wilkes Award. Contact him at [email protected]. John D. Owens is an assistant professor of electri- cal and computer engineering at the University of Brucek Khailany is a postdoctoral scholar in elec- California, Davis. His research interests include trical engineering at Stanford University. His sensor networks, graphics and graphics architec- research interests include power-efficient computer tures, stream computing, and computer architec- architecture, media processing, circuits, and com- ture. Owens received a PhD in electrical puter arithmetic. Khailany received a PhD in elec- engineering from Stanford University. Contact him trical engineering from Stanford University. He is at [email protected]. Get CSDP Certified Certified Software Development Professional Program 2004 test windows: 1 April – 30 June and 1 September – 30 October

Doing Software Right "The exam is valuable to me for two reasons: One, it validates my knowledge in various areas of expertise • Demonstrate your level of ability in within the software field, without regard to specific knowl- relation to your peers edge of tools or commercial products... Two, my participation, along with others, in the exam and • Measure your professional knowledge in continuing education sends a message that software development is a professional pursuit requiring advanced and competence education and/or experience, and all the other requirements the IEEE Computer Society has established. I also believe The CSDP Program differentiates between in living by the Software Engineer- you and others in a field that has every kind of ing code of ethics endorsed by the credential, but only one that was developed by, Computer Society.All of this will for, and with software engineering professionals. help to improve the overall quality of the products and services we provide to our customers..." Applications available in late August —Karen Thurston, Visit the CSDP web site at http://computer.org/certification Base Two Solutions or contact [email protected]