...... Guest Editors’ Introduction ...... ACCELERATOR ARCHITECTURES

...... We are entering the golden age of What is an accelerator? the computational accelerator. The com- Let us first attempt to better define the mercial accelerator space is vibrant with notion of an accelerator. An accelerator is a activity from semiconductor vendors, large separate architectural substructure (on the and small, that are designing accelerators for same chip, or on a different die) that is graphics, physics, network processing, and a architected using a different set of objectives variety of other applications. System ven- than the base processor, where these dors are introducing tools and program- objectives are derived from the needs of a ming systems to lower the barriers to entry special class of applications. Through this for software development for their plat- manner of design, the accelerator is tuned to forms. We are already seeing the initial provide higher performance at lower cost, or stream of applications that benefit from at lower power, or with less development these accelerators, and there are definite effort than with the general-purpose base signs that more are yet to come. The hardware. Depending on the domain, research space is blossoming with very accelerators often bring greater than a broad, multidisciplinary activity in ad- 103 advantage in performance, or cost, or vanced research and development for new power over a general-purpose processor. It’s classes of accelerator architecture and appli- worth noting that this definition is quite cations to tap into their power. broad, covering everything from special- It is our honor to serve as coeditors of this purpose function units added to a base special issue of IEEE Micro on accelerator processor, to computational offload units, architectures. There is much to be said to separate, special processors added to the Sanjay Patel about this area, and the authors of our base platform. Examples of accelerators articles have provided a good sampling of include floating-point , graph- Wen-mei W. Hwu the commercial and academic work in this ics processing units (GPUs) to accelerate the emerging area. rendering of a vertex-based 3D model into a University of Illinois at As with any emerging area, the technical 2D viewing plane, and accelerators for the delimitations of the field are not well motion estimation step of a . Urbana-Champaign established. The challenges and problems We view an accelerator as an augmenta- associated with acceleration are still forma- tion of a base-class, general-purpose, or tive. The commercial potential is not commodity system. As such, the accelerator generally accepted. There is still the ques- is added to the system to achieve greater tion, ‘‘If we build it, who will come?’’ functionality or performance. Figure 1 In this introductory article to this issue, shows the general system architecture of let us spend some time trying to define the an accelerator, where the accelerator is area of computational acceleration, discuss attached to the base processor via an some of the architectural trade-offs, clarify interconnect. Many variations of this model some of the issues, drawbacks, and advan- are possible, including the accelerator tages of applications development on accel- connected via a system bus such as PCI erator architectures, and try to articulate Express or HyperTransport. This is a who might come if we build it. relatively low-cost path, given the commod- ......

4 Published by the IEEE Computer Society 0272-1732/08/$20.00 G 2008 IEEE

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. Figure 1. Several generalized accelerator system architectures.

ity support for these protocols, but, because possibly with coherence activity), but at a these buses are intended to support a wide higher cost, which might not be feasible for variety of devices, they are of typically cost-sensitive markets. modest performance and capability. PCI Even tighter coupling is possible with the Express 2.0, for example, provides up to accelerator directly on the same die as the 16 Gbytes/s of bandwidth at microseconds processor, as is used with the CellBE of latency. processor. The choice of particular integra- Some accelerator domains require tighter tion model depends on the nature of the coupling between the accelerator and the domain, the volume of chips its market can processor, and for such domains other support, and the general cost of solution it models may be more appropriate. The can bear. In this view, the accelerator can be accelerator can be attached via the processor thought of as a heterogeneous extension to bus, such as a front-side bus, where the the base platform. accelerator would be in a processor-like Accelerators can have macroarchitectures socket in the system, as is the case with the that span from fixed-function, special- AMD .1 This model can provide purpose chips (early generations of graphics higher bandwidth at lower latency, and with chips were of this variety) to highly tighter integration than the system bus (for programmable engines tuned to the needs example, direct access to processor memory, of a particular domain (latest-generation ......

JULY–AUGUST 2008 5

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply...... GUEST EDITORS’ INTRODUCTION

graphics chips are of this variety). The law reduces the cost of performance in the choices of macroarchitecture are driven CPU. Eventually, the market for the primarily by the diversity of computation accelerator erodes, and the accelerator dies. in the domain requiring the accelerator. The Examples include floating-point coproces- more varied the computation, the more sors, audio DSPs for PCs, and video decode programmable the accelerator will need acceleration chips. A sustainable accelerator to be. model requires an application domain Generally speaking, accelerator architec- where ‘‘too much performance is never tures maximize throughput per unit area of enough.’’ silicon, (or depending on product and technology constraints, throughtput per Application domains watt), by invariably exploiting parallelism Although it would be highly presumptu- via fine-grained data-parallel hardware. ous of us to attempt to articulate future Almost all exploit some form of multiword applications that will demand acceleration, single-input, multiple-data (SIMD) opera- we can examine a few domains that demand tion, as SIMD hardware can have better it today. Graphics, gaming, network pro- throughput per area over a more general cessing (which includes TCP offload, en- parallelism model; however, this passes cryption, deep packet inspection, intelligent some optimization cost on to the developer. routing, IPTV, and XML processing), and Various memory models are employed: video encoding are the well-known spaces in some accelerators have software-managed which commercial accelerator chips for memories, some have hardware-managed improving system performance are generally caches, some have direct hardware support commercially viable. Markets exist for for interprocessor communication, some specialized chips for image processing and have very high-bandwidth channels to specialized functional blocks for SoCs for external memory. Accelerators also tend to mobile devices (to improve overall perfor- use specialized, fixed-function hardware for mance per watt). Scientific , oil frequent, regular computation. When con- and gas exploration, and financial modeling trasted to general-purpose CPU architec- have also been strong markets in which the tures, which are optimized for low latency accelerator model has provided value, and a richer application programming particularly as more computation is done model, the fine-grained parallel accelerator on an interactive, client-side basis in order architectures appear more akin to digital to drastically reduce delay to discovery or signal processors (DSPs) and early RISC decision making. processors that moved the performance Why does the acceleration model work burden out of the hardware into the well for these domains? Primarily, these software layers. domains fit the mold in which ‘‘too much Viewed more from a market-driven performance is never enough.’’ Additional perspective, accelerators arise because small performance provided by the base platform sets of economically relevant applications is too costly (in dollars or in watts) or not demand more performance or more func- presently possible, thus a customized solu- tionality than the base platform can pro- tion makes economic sense. vide. The economics of the situation justify From a more technical perspective, these the inclusion of additional hardware. His- domains are amenable to an accelerator- tory has taught us that this is a precarious based solution, for which a combination of path, though. Many ideas for computation- parallelism, pipelining, and regularity of al acceleration have been proposed; many computation are necessary. Accelerators use companies have attempted commercial parallelism to gain their performance ad- solutions and found little success. People vantage over the general-purpose processors; often fail to include the solution’s longevity thus, the key computation in the domain of in their feasibility assessment of the tech- interest must exhibit substantial parallelism nology. Eventually, the general-purpose to take advantage of the accelerator. Since device usurps the accelerator, as Moore’s the accelerator is working in concert with ...... 6 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. the CPU, which is running the bulk of the portions of the computation to offload, application code, the application must be isolate the data structures required by these amenable to offloading via a relatively long- portions, manage the transfer of these data latency and low-bandwidth interconnect to structures to the accelerator memory (if the the accelerator. For this, the application accelerator has a separate address space from must deliver computational work units of the main processor), synchronize between sufficient size to the accelerator, and in the the main processor and the accelerator, general case, send them in a decoupled, transfer the result data back to the main pipelined fashion to the accelerator, without memory, and integrate the results into the stalling on the return of the results. Some original data structures. These offloading accelerators exploit the regularity of the tasks are in addition to the original computation or communication patterns of development effort required for the non- a particular domain to gain their advantage accelerated software. What makes this over the general-purpose solution. Embed- process even more difficult is that data ding specialized hardware paths and logic transfer between the CPU and the acceler- into the accelerator results in computation- ator is often a major performance overhead, ally denser substructures than using general- and so needs to be tuned and optimized. To purpose paths. If a substantial amount of achieve the desired performance benefits, the computation can be mapped onto the the application developer must overlap the special paths, the architecture can realize a computation with data transfer in a pipe- net gain. lined fashion, which often requires major changes to the program structure surround- Application development ing those components to be offloaded. One An important criterion for the viability of architectural approach to reduce this com- a particular accelerator technology is the plexity is to have a shared memory space cost of development for applications to tap (potentially cache coherent) between the into the technology. Traditionally, most main processor and the accelerator, and this accelerator studies have focused on the level is the option provided by the AMD of speedup one can achieve with accelera- Torrenza and CSI bus models for tors, with little or no attention paid to the accelerator integration. Gelado et al. pro- development effort required for achieving posed architectural enhancements that allow the speedup. There are few domains that entire data structures to be hosted by either require performance (or lower power) at any the CPU memory or the accelerator cost. In a typical application scenario, memory on a dynamic basis to relieve the however, engineering cost can quickly erode application developers from the burden of the performance benefit. A sensible vice manually having to manage this process.2 president of engineering will always ques- The second source of complexity arises tion the cost involved in achieving the from the current lack of a standard speedup. If the costs are too high, the architectural model (de facto or otherwise) accelerator platform will be correctly reject- for accelerators. General-purpose processing ed over the lower-performing, lower-cost, has benefited greatly from standardization general-purpose solution. It is therefore around the PC architecture and its various important that one understands the major flavors. Software developers can develop sources of cost in using accelerators in code knowing that their code will have a application development. significant current and future base of There are, in general, four major con- installed hardware on which it can run, tributors to this cost. First, the nature of an and that their software can be ported to accelerator as a separate architectural entity similar platforms with well-understood working in concert with the general-purpose costs. Some acceleration domains, such as CPU makes the application development raster-based graphics, have well-defined intrinsically more complex. As we described standards for applications development, earlier, the application must be offloaded. which in return has provided a commercial An application developer must identify the explosion of applications that use the ......

JULY–AUGUST 2008 7

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply...... GUEST EDITORS’ INTRODUCTION

accelerator hardware. Most accelerator spac- ease of programmability. Accelerators often es are, however, less mature, much more have software-managed memories, special- fragmented, and currently highly dynamic. purpose hardware, or raw hardware inter- An application developer considering accel- faces, all of which require additional eration of a computationally intensive software complexity for management or kernel has a variety of off-the-shelf archi- tuning in return for the higher performance. tectures to choose from—such as GPU, For example, software-managed SRAMs floating-point arrays, and FPGA architec- offer a sizable increase in on-chip storage tures—each with its own application devel- density over hardware-managed caches, but opment environment and source language! require additional work on part of the For each of these hardware choices, the developer to manage the SRAMs and to longevity of the programming interface is remap data structures to the address space of uncertain: it is unclear whether future chips the local memories. Several proposed pro- from the same vendor will continue to gramming models attempt to abstract away support the same interface. In the worst some of these details. For example, the case, an application developer needs to craft Sequoia model offers a virtualized, self- a different version of the application for managed set of on-chip memory types that each accelerator. In the GPU space, at- can be automatically targeted to different tempts have been made to take advantage of hardware architectures, insulating the de- the standard graphics libraries such as veloper from managing their own on-chip OpenGL to provide a uniform program- memory.4 ming model across different GPU imple- The fourth source of complexity comes mentations, such as the RapidMind API or from optimization. Those seeking to accel- the emerging OpenCL standard. Such erate their application are fundamentally approaches, however, are limited by the looking at a multivariable optimization restrictions imposed by the underlying low- problem that includes parallelization. Get- level graphics interfaces, which has prompt- ting good performance on an accelerator ed several vendors to introduce models of often requires sophisticated code and data their own that map directly down to their structure transformations to expose the right hardware. AMD has provided support using variety and amount of parallelism (SIMD the Brook+ streaming model for accelera- parallelism, in most cases) to effectively tion onto its ATI GPUs. ’s CUDA utilize chip resources and to adhere to the (discussed in another article in this issue) is idiosyncrasies of the underlying hardware. a shared memory-like data parallel pro- One typically needs to perform simulta- gramming model for Nvidia GPUs. We are neous optimization of the algorithm, data even seeing generalizations of these vendor- structure selection, threading granularity, specific models, such as the recent MCUDA data-tiling dimensions, register usage, data tool3 that enables CUDA source code to be prefetch distance, and loop unrolling. These efficiently targeted to both Nvidia GPUs parameters are often not orthogonal to each and multicore CPUs using compiler trans- other: changing one may require changing formations to automatically remap the another due to limited resources, such as the application using threading granularities threading degree and local register file size and memory types appropriate for the in a CUDA program. The performance particular target architecture. For many effect of varying these parameters is often accelerators, however, the development nonintuitive and requires actual code devel- tools are less mature, less stable, and with opment and execution time measurements fewer features than for the PC-based to quantify. The Spiral project seeks to environment, adding risks to the accelera- automate the tuning process for digital tor-based development effort. signal processing algorithms.5 Ryoo et al. The third source of complexity arises use a parameterized programming style and from the fact that accelerators are designed automated parameter space search and to maximize computation throughput, pruning methodology to reduce the pro- which is often achieved at the expense of gramming efforts required to achieve opti- ...... 8 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. mal combination of parameter values.6 On a gy efficiency and general purpose macroscopic level, Amdahl’s law must be programmability to meet the demands contended with: the nonaccelerated portion of visual computing and other work- of the application can start to dominate loads that are inherently parallel in performance, requiring optimization or nature. acceleration of its own. Often the scope of the acceleration effort increases as new The CellBE was similarly envisioned to bottlenecks become exposed, and a larger be a general-purpose accelerator rooted in fraction of the original code base requires the broader consumer market. IBM, in optimization. partnership with Sony and Toshiba, devel- oped the architecture as a general comput- A quick survey of accelerator projects ing accelerator for consumer applications, The vibrancy of accelerators can be such as the Playstation 3 game console, but strongly felt in the amount of commercial also to span into specialized acceleration for activity in the area. Major semiconductor (for example, Roadrunner, vendors are introducing fine-grained data- http://www.lanl.gov/orgs/hpc/roadrunner). parallel architectures for general-purpose Looking past the major vendors, we also acceleration, with their products commer- notice a flurry of activity in smaller cially anchored in the graphics acceleration companies hoping to tap into an emerging space. accelerator market. XMOS, Ambric, Plu- Nvidia and AMD both have announced rality, Movidia, Stream Processors, Tilera, general-purpose computing products (hard- Element CXi, Clearspeed, AGEIA, Rap- ware and software) that use their 3D port, and Cavium are a few of the graphics accelerator architectures. Products companies that have publicly announced from both companies were formative for the chips that provide acceleration for particular general-purpose GPU (GPGPU) movement application spaces—nearly all of them that spawned much of the commercial GPU providing ASIC-based solutions that involve computing activity. Nvidia recently an- general-purpose acceleration using tens to nounced the G280, a 240-core chip that hundreds of cores. For areas where the supports nearly 1 Tflops of peak single- acceleration demand is high, but for which precision performance and is programmable no ASIC solution is yet available, developers via the CUDA model. AMD broke the 1- are opting to use FPGA-based solutions. Tflops mark with its ATI HD 4800 FPGAs developed by and are series GPU, which has 1.2-Tflops peak integrated by system integrators such as performance. Intel has a project code- Nallatech, XtremeData, and Alpha Data named Larrabee that is intended to be an into accelerator boards for commodity accelerator for visual computing. According systems. In a few cases, owing to the to an official Intel statement, extreme regularity of computation and communication, FPGA-based solutions The Larrabee architecture will be Intel’s can provide substantial value over even next step in evolving the visual comput- fine-grained general-purpose accelerators ing platform. The Larrabee architecture such as GPUs. For example, FPGA solu- includes a high-performance, wide tions for accelerating logic simulation are SIMD vector processing unit (VPU) particularly profitable because individual bit along with a new set of vector instruc- operations in the logic to be simulated can tions including integer and floating point be mapped directly into data paths in arithmetic, vector memory operations hardware. and conditional instructions. In addition, In the research sphere, several academic Larrabee includes a major new hard- projects in the recent past have led to ware coherent cache design enabling commercial impact in the accelerator area. the many-core architecture. The archi- Such seminal projects include the Imagine tecture and instructions have been at Stanford7 and Raw at MIT.8 These designed to deliver performance, ener- projects laid the groundwork for current ......

JULY–AUGUST 2008 9

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply...... GUEST EDITORS’ INTRODUCTION

thinking in accelerator architecture, provid- architects, we largely understand how to ing early examples of highly parallel chips provide value here. and simple, effective programming models. The larger and mostly open question is Presently, Intel has ongoing projects in its how to reduce the cost of software devel- TeraScale Computing initiative, involving opment to utilize this grand unified archi- algorithms, applications, tools, and archi- tecture of base processor and accelerator. tectures, including the 80-core, 1-Tflops For non-performance-sensitive applications Polaris chip announced in 2007.9 Intel has (of which there are many), the easiest path codified a vision for future accelerator to market is to run on the base processor. applications as those that involve recogni- Performance-sensitive applications will need tion, mining, and synthesis. At University to be optimized to take advantage of the of California, Davis, Baas et al. are working fine-grained array; this, as we have articu- on high-density computational structures lated, is a path with many pitfalls, all of configured in an asynchronous array.10 At which erode the value proposition to University of Illinois at Urbana-Cham- software developers. paign, we are leading an effort called the As advanced developers and system Rigel project, which is a 2,000+-core, fine- researchers, we must undertake the task of grained multiple-instruction-stream, multi- charting the roadmap of feasible solutions. ple-data-stream (MIMD) accelerator for Some issues will require attention from the visual computing applications, such as commercial sphere: a lack of standards and real-time computer vision and imaging.11 shifting architectural roadmap will impede Rigel pushes the envelope on throughput application development en masse, but this per unit area, making trade-offs that can be solved with foresight and commercial promote effective programmability for par- collaboration. Other issues require more allel applications with minimal compromise research: effective parallelism models, tools, to performance. Rigel will be capable of languages, APIs, and frameworks for appli- multiple Tflops of peak performance, and cation development are needed to lower the programmable via a bulk synchronous cost of developing applications for acceler- programming model. ators. Researchers should continue to un- derstand the problems, and they should Toward the general-purpose accelerator build prototype solutions. It is an exciting time for computational accelerators, as evidence mounts that we are In this issue nearing transformative thresholds in com- It is our pleasure to introduce this IEEE puter performance that will enable break- Micro special issue on accelerator architec- throughs in many fields, including interac- tures. In this issue, we have included an tive, intelligent visual computing appli- invited article from Garland et al. of Nvidia cations. This, perhaps, is the driver that that describes some of the experiences, has caused the GPU’s transformation into a optimization strategies, pitfalls, and success- general-purpose accelerator, and it may es in porting applications to its GPU-based make the accelerator the dominant silicon computing accelerators using the CUDA component in commodity computing sys- programming environment. tems. One possible grand unified architec- We also received a good number of ture contains a base general-purpose pro- submissions to the open call for papers for cessor that is relatively small, necessary for this issue, from which we selected four legacy and control applications (which are articles that provide a good sampling of not worth parallelizing to the accelerator) advances in the broad space of accelerator and an accelerator that provides the com- applications and architectures. Woo et al. puting horsepower and commands the bulk describe the POD architecture, an applica- of the silicon. The accelerator architecture tion accelerator that can attach to a base will most likely consist of a fine-grained processor using 3D integration, such as die array of cores, architected to optimize stacking. Bougard et al. apply accelerator performance per area or watt. As silicon architecture principles to the domain of ...... 10 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. software-defined radios, and demonstrate a Computer Design (ICCD 02), IEEE CS chip design that achieves substantial in- Press, 2002, pp. 282-288. struction throughput at very low power, 8. M.B. Taylor et al., ‘‘The Raw Microproces- which is particularly important for wireless sor: A Computational Fabric for Software devices. Wen et al. describe a hybrid on-chip Circuits and General-Purpose Programs,’’ dual-mode memory system, which can func- IEEE Micro, vol. 22, no. 2, Mar./Apr. 2002, tion as a purely software-managed SRAM or pp. 25-35. as a hardware-managed cache when the 9. J. Held, J. Bautista, and S. Koehl, ‘‘From a application’s reference stream is irregular. Few Cores to Many: A Tera-scale Comput- Last, we include an article outside the ing Research Overview,’’ white paper, norm, on a potentially interesting special- Intel, 2006; ftp://download.intel.com/research/ ized application domain involving bioim- platform/terascale/terascale_overview_paper. plantable devices. In this article, Jin and pdf. Cheng describe a set of benchmarks that 10. Z. Yu et al., ‘‘An Asynchronous Array of represent the base computation for applica- Simple Processors for DSP Applications,’’ tions that could potentially improve human IEEE Int’l Solid-State Circuits Conference health and viability. (ISSCC 06), IEEE Press, 2006, pp. 428-429. We hope you enjoy this issue! MICRO 11. J. Kelm et al., ‘‘Rigel: A Scalable Processor Architecture for 1000+ Core Computing,’’ ...... tech. report IMPACT-08-02, Univ. of Illinois References at Urbana-Champaign, 2008. 1. M. Hummel, M. Krause, and D. O’Flaherty, ‘‘AMD and HP: Protocol Enhancements Sanjay J. Patel is an associate professor of for Tightly Coupled Accelerators,’’ white electrical and computer engineering and paper, AMD, 2007; http://www.hp.com/ Willett Faculty Scholar at the University of techservers/hpccn/hpccollaboration/ADCatalyst/ Illinois at Urbana-Champaign. His research downloads/AMD_HPTCAWP.pdf. interests include high-performance and 2. I. Gelado et al., ‘‘CUBA: An Architecture for massively parallel chip architectures, parallel Efficient CPU/Co-processor Data Commu- programming models, and visual comput- nication,’’ Proc. 22nd Ann. Int’l Conf. ing applications. He has considerable com- Supercomputing (ICS 08), ACM Press, mercial chip design experience, working and pp. 299-308. consulting for a number of companies, 3. J. Stratton, S. Stone, and W.-m. Hwu, including Digital Equipment Corporation ‘‘MCUDA: An Efficient Implementation of and Intel. From 2005 to 2008, he was the CUDA Kernels on Multi-cores,’’ tech. report CTO and Chief Architect of AGEIA IMPACT-08-01, Univ. of Illinois at Urbana- Technologies, a fabless semiconductor in- Champaign, 2008. dustry that developed chips for accelerating 4. K. Fatahalian et al., ‘‘Sequoia: Programming physical simulation for video games. Patel the Memory Hierarchy,’’ Proc. 2006 ACM/ earned his BS, MS, and PhD in computer IEEE Conf. Supercomputing (SC 06), science and engineering from the University ACM Press, 2006; http://doi.acm.org/10.1145/ of Michigan, Ann Arbor. He is a member of 1188455.1188543. the IEEE. 5. M. Puschel et al., ‘‘SPIRAL: Code Genera- tion for DSP Transforms,’’ Proc. IEEE, Wen-mei W. Hwu is Sanders-AMD En- vol. 93, no. 2, Feb. 2005, pp. 232-275. dowed Chair Professor in the Department 6. S. Ryoo et al., ‘‘Optimization Principles and of Electrical and Computer Engineering, Application Performance Evaluation of a University of Illinois at Urbana-Cham- Multithreaded GPU Using CUDA,’’ Proc. paign. He also directs the IMPACT research 13th ACM SIGPLAN Symp. Principles and group (Illinois Microarchitecture Project Practice of Parallel Programming (PPoPP utilizing Advanced Compiler Technology, 08), ACM Press, 2008, pp. 73-82. www.crhc.uiuc.edu/Impact) and is hard- 7. U. Kapasi et al., ‘‘The Imagine Stream ware lead of the NSF Petascale Computer Processor,’’ Proc. 2002 IEEE Int’l Conf. Project awarded to the University of Illinois ......

JULY–AUGUST 2008 11

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply...... GUEST EDITORS’ INTRODUCTION

and IBM in 2007. His research interests are science from the University of California, in architecture, implementation, and pro- Berkeley. He is a fellow of both the IEEE gramming tools for parallel computer and the ACM. systems—in particular, novel computer architectures and the compiler techniques Direct questions and comments about they require. For his contributions in this special issue to Sanjay Patel, 262 research and teaching, he received the Eta Coordinated Sciences Laboratory, MC Kappa Nu Holmes MacDonald Outstand- 228, 1308 West Main St., Urbana, IL ing Teaching Award, the ACM SigArch 61801-2307; [email protected]. Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi For more information on this or any Daniel C. Drucker Eminent Faculty Award, other computing topic, please visit our and the ACM/IEEE ISCA Most Influential Digital Library at http://computer.org/ Paper Award. Hwu has a PhD in computer publications/dlib.

...... 12 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply.