Accelerator Architectures

................................................................................................................................Guest Editors’ Introduction ..................................................................................................................... ACCELERATOR ARCHITECTURES ...... We are entering the golden age of What is an accelerator? the computational accelerator. The com- Let us first attempt to better define the mercial accelerator space is vibrant with notion of an accelerator. An accelerator is a activity from semiconductor vendors, large separate architectural substructure (on the and small, that are designing accelerators for same chip, or on a different die) that is graphics, physics, network processing, and a architected using a different set of objectives variety of other applications. System ven- than the base processor, where these dors are introducing tools and program- objectives are derived from the needs of a ming systems to lower the barriers to entry special class of applications. Through this for software development for their plat- manner of design, the accelerator is tuned to forms. We are already seeing the initial provide higher performance at lower cost, or stream of applications that benefit from at lower power, or with less development these accelerators, and there are definite effort than with the general-purpose base signs that more are yet to come. The hardware. Depending on the domain, research space is blossoming with very accelerators often bring greater than a broad, multidisciplinary activity in ad- 103 advantage in performance, or cost, or vanced research and development for new power over a general-purpose processor. It’s classes of accelerator architecture and appli- worth noting that this definition is quite cations to tap into their power. broad, covering everything from special- It is our honor to serve as coeditors of this purpose function units added to a base special issue of IEEE Micro on accelerator processor, to computational offload units, architectures. There is much to be said to separate, special processors added to the Sanjay Patel about this area, and the authors of our base platform. Examples of accelerators articles have provided a good sampling of include floating-point coprocessors, graph- Wen-mei W. Hwu the commercial and academic work in this ics processing units (GPUs) to accelerate the emerging area. rendering of a vertex-based 3D model into a University of Illinois at As with any emerging area, the technical 2D viewing plane, and accelerators for the delimitations of the field are not well motion estimation step of a video codec. Urbana-Champaign established. The challenges and problems We view an accelerator as an augmenta- associated with acceleration are still forma- tion of a base-class, general-purpose, or tive. The commercial potential is not commodity system. As such, the accelerator generally accepted. There is still the ques- is added to the system to achieve greater tion, ‘‘If we build it, who will come?’’ functionality or performance. Figure 1 In this introductory article to this issue, shows the general system architecture of let us spend some time trying to define the an accelerator, where the accelerator is area of computational acceleration, discuss attached to the base processor via an some of the architectural trade-offs, clarify interconnect. Many variations of this model some of the issues, drawbacks, and advan- are possible, including the accelerator tages of applications development on accel- connected via a system bus such as PCI erator architectures, and try to articulate Express or HyperTransport. This is a who might come if we build it. relatively low-cost path, given the commod- ....................................................................... 4 Published by the IEEE Computer Society 0272-1732/08/$20.00 G 2008 IEEE Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. Figure 1. Several generalized accelerator system architectures. ity support for these protocols, but, because possibly with coherence activity), but at a these buses are intended to support a wide higher cost, which might not be feasible for variety of devices, they are of typically cost-sensitive markets. modest performance and capability. PCI Even tighter coupling is possible with the Express 2.0, for example, provides up to accelerator directly on the same die as the 16 Gbytes/s of bandwidth at microseconds processor, as is used with the CellBE of latency. processor. The choice of particular integra- Some accelerator domains require tighter tion model depends on the nature of the coupling between the accelerator and the domain, the volume of chips its market can processor, and for such domains other support, and the general cost of solution it models may be more appropriate. The can bear. In this view, the accelerator can be accelerator can be attached via the processor thought of as a heterogeneous extension to bus, such as a front-side bus, where the the base platform. accelerator would be in a processor-like Accelerators can have macroarchitectures socket in the system, as is the case with the that span from fixed-function, special- AMD Torrenza.1 This model can provide purpose chips (early generations of graphics higher bandwidth at lower latency, and with chips were of this variety) to highly tighter integration than the system bus (for programmable engines tuned to the needs example, direct access to processor memory, of a particular domain (latest-generation ........................................................................ JULY–AUGUST 2008 5 Authorized licensed use limited to: Politecnico di Milano. Downloaded on January 29, 2009 at 09:01 from IEEE Xplore. Restrictions apply. ......................................................................................................................................................................................................................... GUEST EDITORS’ INTRODUCTION graphics chips are of this variety). The law reduces the cost of performance in the choices of macroarchitecture are driven CPU. Eventually, the market for the primarily by the diversity of computation accelerator erodes, and the accelerator dies. in the domain requiring the accelerator. The Examples include floating-point coproces- more varied the computation, the more sors, audio DSPs for PCs, and video decode programmable the accelerator will need acceleration chips. A sustainable accelerator to be. model requires an application domain Generally speaking, accelerator architec- where ‘‘too much performance is never tures maximize throughput per unit area of enough.’’ silicon, (or depending on product and technology constraints, throughtput per Application domains watt), by invariably exploiting parallelism Although it would be highly presumptu- via fine-grained data-parallel hardware. ous of us to attempt to articulate future Almost all exploit some form of multiword applications that will demand acceleration, single-input, multiple-data (SIMD) opera- we can examine a few domains that demand tion, as SIMD hardware can have better it today. Graphics, gaming, network pro- throughput per area over a more general cessing (which includes TCP offload, en- parallelism model; however, this passes cryption, deep packet inspection, intelligent some optimization cost on to the developer. routing, IPTV, and XML processing), and Various memory models are employed: video encoding are the well-known spaces in some accelerators have software-managed which commercial accelerator chips for memories, some have hardware-managed improving system performance are generally caches, some have direct hardware support commercially viable. Markets exist for for interprocessor communication, some specialized chips for image processing and have very high-bandwidth channels to specialized functional blocks for SoCs for external memory. Accelerators also tend to mobile devices (to improve overall perfor- use specialized, fixed-function hardware for mance per watt). Scientific computing, oil frequent, regular computation. When con- and gas exploration, and financial modeling trasted to general-purpose CPU architec- have also been strong markets in which the tures, which are optimized for low latency accelerator model has provided value, and a richer application programming particularly as more computation is done model, the fine-grained parallel accelerator on an interactive, client-side basis in order architectures appear more akin to digital to drastically reduce delay to discovery or signal processors (DSPs) and early RISC decision making. processors that moved the performance Why does the acceleration model work burden out of the hardware into the well for these domains? Primarily, these software layers. domains fit the mold in which ‘‘too much Viewed more from a market-driven performance is never enough.’’ Additional perspective, accelerators arise because small performance provided by the base platform sets of economically relevant applications is too costly (in dollars or in watts) or not demand more performance or more func- presently possible, thus a customized solu- tionality than the base platform can pro- tion makes economic sense. vide. The economics of the situation justify From a more technical perspective, these the inclusion of additional hardware. His- domains are amenable to an accelerator- tory has taught us that this is a precarious based

Accelerator Architectures

Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster

For Immediate Release

Department of Computer Science

Conga-TR4 User's Guide

SIGARCH FY'06 Annual Report July 2005- June 2006 Submitted By

Lewis University Dr. James Girard Summer Undergraduate Research Program 2021 Faculty Mentor - Project Application

Virtualization: Comparision of Windows and Linux

A Survey of Reconfigurable Processors

SIGARCH Annual Report July 2009 - June 2010

Who Owns 3D Scans of Historic Sites? Three-Dimensional Scanning Can Be Used to Protect Or Rebuild Historic Structures, but Who Owns That Digital Data?

Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild

Introduction Hardware Acceleration Philosophy Popular Accelerators In