...... RAMP: RESEARCH ACCELERATOR FOR MULTIPLE PROCESSORS

...... THE RAMP PROJECT’S GOAL IS TO ENABLE THE INTENSIVE, MULTIDISCIPLINARY INNOVATION John Wawrzynek THAT THE COMPUTING INDUSTRY WILL NEED TO TACKLE THE PROBLEMS OF PARALLEL David Patterson PROCESSING. RAMP ITSELF IS AN OPEN-SOURCE, COMMUNITY-DEVELOPED, FPGA-BASED University of California, EMULATOR OF PARALLEL ARCHITECTURES. ITS DESIGN FRAMEWORK LETS A LARGE, Berkeley COLLABORATIVE COMMUNITY DEVELOP AND CONTRIBUTE REUSABLE, COMPOSABLE DESIGN

Mark Oskin MODULES. THREE COMPLETE DESIGNS—FOR TRANSACTIONAL MEMORY, DISTRIBUTED University of Washington SYSTEMS, AND DISTRIBUTED-SHARED MEMORY—DEMONSTRATE THE PLATFORM’S POTENTIAL.

Shih-Lien Lu ...... In 2005, the computer hardware N Prototyping a new architecture in industry took a historic change of direction: hardware takes approximately four The major microprocessor companies all an- years and many millions of dollars, nounced that their future products would even at only research quality. Christoforos Kozyrakis be single-chip multiprocessors, and that N engineers are ineffective until future performance improvements would the new hardware actually shows up, Stanford University rely on software-specified parallelism rather because simulators are too slow to than additional software-transparent paral- support serious software development lelism extracted automatically by the micro- activities. Software engineers tend to James C. Hoe architecture. Several of us discussed this mile- innovate only after hardware arrives. stone at the International Symposium on N Feedback from software engineers on Carnegie Mellon University Computer Architecture (ISCA) in June the current production hardware can- 2005. We were struck that a multibillion- not help the immediate next genera- dollar industry would bet its future on tion because of overlapped hardware Derek Chiou solving the general-purpose parallel com- development cycles. Instead, the feed- puting problem, when so many have back loop can take several hardware University of Texas at Austin previously attempted but failed to provide generations to close fully. a satisfactory approach. To tackle the parallel processing prob- Hence, we conspired on how to create an Krste Asanovic´ lem, our industry urgently needs innovative inexpensive, reconfigurable, highly parallel solutions, which in turn require extensive platform that would attract researchers from Massachusetts Institute codevelopment of hardware and software. many disciplines—architectures, compilers, However, this type of innovation currently operating systems, applications, and others— of Technology gets bogged down in the traditional de- to work together on perhaps the greatest velopment cycle: challenge facing computing in the past ......

46 Published by the IEEE Computer Society 0272-1732/07/$25.00 G 2007 IEEE 50 years. Because our industry desperately A second example of how RAMP differs needs solutions, our goal is to develop a from real hardware is reproducibility. Using platform that would allow far more rapid the RAMP Description Language (RDL) evolution than traditional approaches. framework, different researchers can con- struct the same deterministic parallel com- RAMP vision puting system that will perform exactly the Our hallway conversations led us to the same way every time, clock cycle for clock idea of using field-programmable gate arrays cycle. By using processor designs donated by (FPGAs) to emulate highly parallel architec- industry, RAMP users will start with familiar tures at hardware speeds. FPGAs enable very architectures and operating systems, which rapid turnaround for new hardware. You can will provide far more credibility than software tape out a FPGA design every day, and have simulations that model idealized processors a new system fabricated overnight. Another or that ignore operating-system effects. RDL key advantage of FPGAs is that they easily is designed to make constructing a full com- exploit Moore’s law. As the number of cores puter out of RDL-compatible modules easy. per microprocessor die grows, FPGA density Our target speeds of 100 to 200 MHz are will grow at about the same rate. Today we slower than real hardware but fast enough to can map about 16 simple processors onto a run standard operating systems and large- single FPGA, which means we can construct scale applications that are orders of magni- a 1,000-processor system in just 64 FPGAs. tude faster than software simulators. Finally, because of the similarities in the design flow Such a system is cheaper and consumes less of logic for FPGAs and custom hardware, we power than a custom multiprocessor, at believe RAMP is realistic enough to convince about $100 and 1 W per processor. software developers to start aggressive de- Because our goal is to ramp up the rate of velopment on innovative architectures and innovation in hardware and software multi- programming models and to convince hard- processor research, we named this project ware and software companies that RAMP RAMP (Research Accelerator for Multiple results are relevant. Processors). RAMP is an open-source project This combination of cost, power, speed, to develop and share the hardware and flexibility, observability, reproducibility, and software necessary to create parallel architec- credibility will make the platform attractive tures. RAMP is not just a hardware architec- to software and hardware researchers in- ture project. Perhaps our most important terested in the parallel challenge. In particu- goal is to support the software community as lar, it allows the research community to revive it struggles to take advantage of the potential the 1980s culture of building experimental capabilities of parallel microprocessors, by hardware and software systems, which today providing a malleable platform through has been almost entirely lost because of the which the software community can collabo- higher cost and difficulty of building hard- rate with the hardware community. ware. Unlike commercial multiprocessor hard- Table 1 compares alternatives for pursuing ware, RAMP is designed as a research plat- parallel-systems research in academia. The form. We plan to include research features four options are a conventional shared- that are impossible to include in real hard- memory multiprocessor (SMP), a cluster, a ware systems owing to speed, cost, or prac- simulator, a custom-built chip and system, ticality issues. For example, the FPGA design and RAMP. The rows are the features of can incorporate additional hardware to mon- interest, with a grade for each alternative and itor any event in the system. Being able to add quantification where appropriate. Cost rules arbitrary event probes, including arbitrary out a large SMP for most academics. The computation on those events, provides visi- costs of both purchase and ownership make a bility formerly only available in software sim- large cluster too expensive for most academics ulators, but without the inevitable slowdown as well. Our only alternative thus far has been faced by software simulators when introduc- software simulation, and indeed that has been ing such visibility. the vehicle for most architecture research in ......

MARCH–APRIL 2007 47 ...... HOT CHIPS

Table 1. Relative comparison of four options for parallel research. From the architect’s perspective, the most surprising aspect of this table is that not only is performance not the top concern, it is at the bottom of this list. The platform just needs to be fast enough to run the entire software stack.

Feature SMP Cluster Simulator Custom RAMP

Scalability (1,000 CPUs) C A A A A Cost (1,000 CPUs) F ($40M) C ($2M-$3M) A+ ($0M) F ($20M) A ($0.1M-$0.2M) Cost of ownership A D A D A Power/space (kW, racks) D (120, 12) D (120, 12) A+ (.1, 0.1) A B (1.5, 0.3) Development community D A A F B Observability D C A+ AB+ Reproducibility B D A+ AA+ Reconfigurability D C A+ CA+ Credibility of result A+ A+ DA+ B+/A Performance (clock) A (2 GHz) A (3 GHz) F (0 GHz) B (0.4 GHz) C (0.1 GHz) Modeling flexibility D D A B A Overall grade C C+ BB2 A

the past decade. As mentioned, software de- total instructions per second, drops as more velopers rarely use software simulators, be- cores are simulated and as operating-system cause they run too slowly, and results might effects are included, and the amount of not be credible. In particular, it’s unclear how memory required for each node in the host credible results will be to industry when they compute cluster rises rapidly. are based on simulations of 1,000 processors RAMP is obviously attractive to a broad set running small snippets of applications. The of hardware and software researchers in paral- RAMP option is a compromise among these lelism. Some representative research projects alternatives: It is so much cheaper than that we believe could benefit from using custom hardware that it will make highly RAMP are scalable systems affordable to academics. It is as flexible as simulators, allowing rapid N testing the robustness of multiproces- evolution of the state of the art in parallel sor hardware and software under fault computing. And it is so much faster than insertion; simulators that it could actually tempt N developing thread scheduling and data software people to try out a new hardware allocation and migration techniques for idea. large-scale multiprocessors; This speed also lets architects explore a N developing and evaluating ISAs for much larger space in their research and thus large-scale multiprocessors; do a more thorough evaluation of their pro- N creating an environment to emulate a posals. Although architects can achieve high geographically distributed computer, batch simulation throughput using multiple with realistic delays, packet loss, and independent software simulations distribut- so on (Internet in a box); ed over a large computing cluster, this does N evaluating the impact of 128-bit and not reduce the latency of obtaining a single other floating-point representations on key result that can move the research for- convergence of parallel programs; ward. Nor does it help an application devel- N developing and testing hardware and oper trying to debug the port of an applica- software schemes for improved security; tion to the new target system (the emulated N recording traces of complex programs machine is called the target, and underlying running on a large-scale multiprocessor; FPGA hardware is the host). Worse, for N evaluating the design of multiprocessor multiprocessor targets, simulation speed, in switches (serial point-to-point, distrib- both instructions per second per core and uted torus, fat trees); ...... 48 IEEE MICRO N developing data-flow architectures for conventional programming languages; N developing parallel file systems; N testing dedicated enhancements to stan- dard processors; and N compiling software directly into FPGAs.

We believe that RAMP’s upside potential is so compelling that the platform will create a Figure 1. Basic RAMP communication model. ‘‘watering hole’’ effect in academic depart- ments as people from many disciplines use RAMP in their research. As researchers from tocols over well-defined target channels. such diverse fields begin using RAMP, con- Figure 1 gives a simple schematic example versations between disciplines that rarely of two connected units. In practice, a unit will communicate may result, ultimately, in be a large component corresponding to tens helping to more quickly develop multipro- of thousands of gates of emulated hard- cessor systems that are easy to program ware—for example, a processor with an L1 efficiently. Indeed, to help industry win its cache, a DRAM controller, or a network bet on parallelism, we will need the help of router stage. All communication between many people, for the parallel future is not just units is via messages sent over unidirectional an architecture change, but likely a change to point-to-point interunit channels, where each the entire software ecosystem. channel is buffered to allow units to execute decoupled from one another. RAMP design framework Partitioning of target models is far simpler From the earliest stages of the RAMP than the classic circuit-partitioning problem project, it was clear that we needed a stan- associated with traditional FPGA-based cir- dardized design framework to enable a large cuit emulation. Although units will be large, community of users to cooperate and build a we expect them to be relatively small com- useful library of interoperable hardware pared to the FPGA capacity, so they will models. This design framework has several never be partitioned across multiple FPGAs. challenging goals. It must support both cycle- A target model is only partitioned at the accurate emulation of detailed parameterized channel interfaces, leaving units intact. Chan- machine models and rapid functional-only nels connecting units that map to separate emulations. The design framework should FPGAs are implemented using FPGA-to- hide the details in the underlying FPGA emu- FPGA physical links. Currently, partitioning lation substrate from the module designer as is driven by user annotations in RDL, but much as possible, so that groups with dif- eventually we expect to build automatic ferent FPGA emulation hardware can share partitioning tools. designs and RAMP modules for reuse after Each unit faithfully models the behavior of FPGA emulation hardware upgrades. In ad- each target clock cycle in the component. The dition, the design framework should not target unit models can be developed either as dictate the hardware design language (HDL) register-transfer-level (RTL) code in a stan- that the developers choose. Our approach was dard HDL (currently Verilog, VHDL, and to develop a decoupled machine model and Bluespec are supported) for compilation onto design discipline. This discipline is enforced the FPGA fabric, or as software models that by the RDL and a compiler to automate the execute either on attached workstations or on difficult task of providing cycle-accurate em- hard or soft processor cores embedded within ulation of distributed communicating com- the FPGA fabric. Many target units taken ponents.1 from existing RTL code will execute a single The RAMP design framework is based on target clock cycle in one FPGA physical clock a few central concepts. A RAMP target model cycle, giving a high simulation speed. How- is a collection of loosely coupled target units ever, to save FPGA resources, a unit model communicating with latency-insensitive pro- canbedesignedtotakemultiplephysicalhost ......

MARCH–APRIL 2007 49 ...... HOT CHIPS

Table 2. RAMP timeline.

Date Milestone

6 June 2005 Hallway discussions lead to RAMP vision 13 June 2005 The name ‘‘RAMP’’ coined; BEE22 selected as RAMP-1; a dozen people identified to develop RAMP January 2006 RAMP retreat and RDL tutorial at Berkeley March 2006 NSF infrastructure grant awarded June 2006 RAMP retreat at Massachusetts Institute of Technology; RAMP Red running with eight processors on RAMP-1 boards January 2007 RAMP Blue running with 256 processors on eight RAMP-1 boards August 2007 RAMP Red, White, and Blue running with 128 to 256 processors on 16 RAMP-1 boards; accurate clock cycle accounting and I/O model December 2007 RAMP-2 boards redesigned based on Virtex-5 and available for purchase; RAMP Web site has downloadable designs

clock cycles on the FPGA to emulate one system. Users can vary the target latency, target target clock cycle, or might even use a varying bandwidth, and target buffering on each number of physical clock cycles. Initially, the channel at configuration time. The RAMP whole RAMP host system uses the same configuration tools will also provide the option physical clock rate (nominally around of having channels run as fast as the underlying 100 MHz), with some higher physical clock physical hardware will allow, thus supporting rates in off-chip I/O drivers. fast, functional-only emulation. We are also Unit models are synchronized only exploring the option of allowing these param- through the point-to-point channels. The eters to be changed dynamically at target system basic principle is that a unit cannot advance boot time to avoid rerunning the FPGA syn- by a target clock cycle until it has received thesis flow when varying parameters for per- a target clock cycle’s worth of activity on each formance studies. input channel, and until the output channels The configuration tool will include sup- are ready to receive another target cycle’s port for interunit channels to be tapped and worth of activity. This scheme forms a dis- controlled to provide monitoring and debug- tributed concurrent-event simulator, where ging facilities. For example, by controlling the buffering in the channels lets units run at stall signals from the channels, a unit can be various physical speeds on the host while single stepped. Using a separate, automati- remaining logically synchronized in terms of cally inserted debugging network, invisible to target clock cycles. Unit model designers target system software, messages can be in- must produce the RTL code (or gateware)of serted and read out from the channels en- each unit in their chosen HDL, and specify tering and leaving any unit, and all significant the range of message sizes that each input or events can be logged. These monitoring and output channel can carry. For each supported debugging facilities will provide significant HDL, the RAMP design framework provides advantages over running applications on tools to automatically generate a unit wrapper commercial hardware. that interfaces to the channels and provides target cycle synchronization. The RTL code RAMP prototypes for the channels is generated automatically by Although most of the participants in the the RDL compiler from an RDL description, project are volunteers, we are on a fairly ag- which includes a structural netlist specifying gressive schedule. Table 2 shows the RAMP the instances of each unit and how they are project timeline. We began RAMP develop- connected by channels. ment using preexisting FPGA boards—see The benefit of enforcing a standard channel- the ‘‘RAMP hardware’’ sidebar. To seed the based communication strategy between units is collaborative effort, we are developing three that many features can be provided automat- prototype systems: RAMP Red, RAMP ically by the RDL compiler and runtime Blue, and RAMP White. Each of our initial ...... 50 IEEE MICRO ...... RAMP hardware Rather than begin the RAMP project by designing yet another FPGA a physical Infiniband 4X (IB4X) electrical connector to form a 10-Gbps, board, for the RAMP-1 system we adopted the Berkeley Emulation full-duplex (20 Gbps total) interface. The IB4X connections are AC Engine.1 BEE2 boards serve as a platform of the first RAMP machine coupled on the receiving end to comply with the Infiniband and 10GBase- prototypes and to help us understand our wish list of features for the CX4 specification. next-generation board. The next generation RAMP hardware platform, Using the 4X Infiniband physical connections, the compute modules currently in design, will be based on a new board design employing the can be wired into many network topologies, such as a 3D mesh. For recently announced Virtex-5 FPGA architecture. applications requiring high-bisection-bandwidth random communication Figure A shows the BEE2 compute module. Each compute module among many compute modules, the BEE2 system is designed to take consists of five Xilinx Virtex-2 Pro-70 FPGA chips, each directly connected advantage of commercial network switch technology, such as Infiniband to four DDR2 240-pin DRAM dual in-line memory modules (DIMMs), with or 10G Ethernet. The regular 10/100Base-T Ethernet connection, a maximum capacity of 4 Gbytes per FPGA. The four DIMMs are available on the control FPGA, provides an out-of-band communication organized into four independent DRAM channels, each running at network for user interface, low-speed system control, monitoring, and 200 MHz (400 DDR) with a 72-bit data interface. Therefore, peak data archival. The compute module runs the Linux OS on the control FPGA aggregate memory bandwidth is 12.8 Gbytes per second for each FPGA. with a full IP network stack. The five FPGAs on the same module are organized into four compute In our preliminary work developing the first RAMP prototypes, we have FPGAs and one control FPGA. The control FPGA has additional global made extensive use of the Xilinx University Program (XUP) Virtex-II Pro interconnect interfaces and control signals to the secondary system Development System (http://www.xilinx.com/univ/xupv2p.html). As with components. The connectivity on the compute module falls into two the BEE2 board, the XUP board uses Xilinx Virtex-II Pro FPGA classes: on-board LVCMOS connections and off-board multigigabit technology—in this case, a single XC2VP30 instead of five XC2VP70s. transceiver (MGT) connections. The local mesh connects the four The XUP board also includes an FPGA-SDRAM interface (DDR instead of compute FPGAs on a 2 3 2 2D grid. Each link between the adjacent DDR2) and several I/O interfaces, such as video, USB2, and Ethernet. FPGAs on the grid provides over 40 Gbps of data throughput per link. The Despite its reduced capacity, the XUP board has been a convenient four down links from the control FPGA to each of the computing FPGAs development platform for key gateware blocks before moving them to the provide up to 20 Gbps per link. These direct FPGA-to-FPGA mesh links BEE2 system. form a high-bandwidth, low-latency mesh network for the FPGAs on the same compute module, so all five FPGAs can be aggregated to form a virtual FPGA with five times the capacity. References All off-module connections use the MGTs on the FPGA. Each individual 1. C. Chang, J. Wawrzynek, and R.W. Brodersen, ‘‘BEE2: A High- MGT channel is configured in software to run at 2.5 Gbps or 3.125 Gbps End Reconfigurable Computing System,’’ IEEE Design & Test, using 8B/10B encoding. Every four MGTs are channel bonded into vol. 22, no. 2, Mar.-Apr. 2005, pp. 114-125.

Figure A. BEE2 module photograph (a) and architecture diagram (b).

......

MARCH–APRIL 2007 51 ...... HOT CHIPS

prototypes contains a complete gateware and one of the cores, while the remaining eight software configuration of a scalable multi- cores execute applications. A light-weight ker- processor populated with standard processor nel in each application core forwards excep- cores, switches, and operating systems. Once tions and system calls to the operating system the base system is assembled and software core. The programming model is multi- installed, users will be able to easily run com- threaded C or Java with locks replaced by plex system benchmarks and then modify this transactional constructs. RAMP Red includes working system as desired. Or they can start an extensive hardware and software frame- from the ground up, using the basic compo- work for debugging, bottleneck identifica- nents to build a new system. We expect users tion, and performance tuning. to release back to the community any en- The RAMP Red design has been fully hancements and new gateware and software operational since June 2006. It runs at modules. A similar usage model has led to the 100 MHz on RAMP-1, which is 100 times proliferation of the SimpleScalar framework, faster than the same architecture simulated which now covers a range of instruction sets in software on a 2-GHz workstation. Wee et and processor designs. al. provide more details of the RAMP Red design.5 Early experiments with enterprise, RAMP Red scientific, and artificial-intelligence applica- RAMP Red is the first multiprocessor tions have demonstrated the simplicity of system with hardware support for transac- parallel programming with transactional tional memory. Transactional memory trans- memory, and that RAMP Red achieves fers the responsibility for concurrency control scalable performance. In the future, RAMP from the programmer to the system.3 It Red will be the basis for further research in introduces database semantics to the shared transactional memory, focusing mostly on memory in a parallel system, which allows software productivity and system software software tasks (transactions) to execute - support.6 ically and in isolation without the use of locks. Hardware support for transactional memory RAMP Blue reduces the overhead of detecting and enfor- RAMP Blue is a family of emulated cing atomicity violations between concur- message-passing machines that can run par- rently executing transactions and guarantees allel applications written for the Message- correct execution under all cases. Passing Interface (MPI) standard, or for RAMP Red implements the Stanford partitioned global-address-space languages Transactional Coherence and Consistency such as Unified Parallel C (UPC). RAMP (TCC) architecture for transactional memo- Blue can also model a networked server ry.4 The design uses nine PowerPC 405 hard cluster. cores (embedded in the Xilinx Virtex-II-Pro The first RAMP Blue prototype was devel- FPGAs) connected to a shared main memory oped at the University of California, Berke- system through a packet-switched network. ley; Figure 2 shows its hardware platform. The built-in data cache in each PowerPC 405 It comprises a collection of BEE2 boards core is disabled and replaced by a custom housed in 2 U chassis and assembled in a cache (emulated in FPGA) with transactional standard 19-inch rack. Physical connection memory support. Each 32-Kbyte cache buf- among the eight boards is through 10-Gbps fers the memory locations that are read and Infiniband cables (light-colored cables in written by a transaction during its execution Figure 2). The BEE2 boards are wired in an and detects atomicity violations with other all-to-all configuration with a direct connec- ongoing transactions. An interesting feature tion from each board to all others through of RAMP Red is the use of a transaction com- 10-Gbps links. System configuration, debug- pletion mechanism that eliminates the need ging, and monitoring take place through for a conventional cache coherence protocol. a 100-Mbps Ethernet switch with connection From an application’s perspective, RAMP to each board’s control FPGA (dark wires at Red is a fully featured Linux workstation. the top of Figure 2). For system management The operating system actually runs on just and control, each board runs a full-featured ...... 52 IEEE MICRO Linux kernel on one PowerPC 405 hardcore embedded in the control FPGA. Our initial target applications are the UPC versions of the NASA Advanced Supercomputing (NAS) Parallel Benchmarks. The four user FPGAs per BEE2 board are configured to hold a collection of 100-MHz Xilinx MicroBlaze soft processor cores run- ning uCLinux. We have mapped eight pro- cessor cores per FPGA. The first prototype, with 32 user FPGAs, emulates a 256-way cluster system. In the future, the number of processor cores can be scaled up through several means. We will add more BEE2 boards—the simple all-to-all wiring configu- ration will accommodate up to 17 boards. We will also add more cores per FPGA—the current configuration of eight processor cores per FPGA only consumes 40 percent of the FPGA’s logic resources. RAMP Blue imple- ments all necessary multiprocessor compo- nents within the user FPGAs. In addition to the soft processor cores, each FPGA holds a packet network switch (one per core) for connection to cores on the same and other FPGAs, shared memory controllers, shared Figure 2. Photograph of RAMP Blue prototype. double-precision floating-point units, and a shared ‘‘console’’ switch for connection to the control FPGA. In RAMP Blue, each processor is assigned tions can access the processor network via tra- its own DRAM memory space (at least 250 ditional socket interfaces. We are planning a Mbytes per processor). The external memory next-generation network interface with direct interface of the MicroBlaze L1 cache con- memory access through special ports in the nects to external DDR2 (double-data-rate 2) memory controller. DRAM through a memory arbiter, as each A 256-core (eight per FPGA) version of DRAM channel is shared among a set of the RAMP Blue prototype has been fully MicroBlaze cores. Because each BEE2 user operational, running the NAS Parallel FPGA has four independent DRAM mem- Benchmark suite, since December 2006. ory channels, four processor cores would This initial prototype was not implemented share one channel in the maximum-sized using RDL, but a newer version based on configuration (16 processor cores per RDL has been operational since February FPGA). With each processor running at 2007. We are currently measuring and 100 MHz and each memory channel run- tuning the prototype’s performance. ning a 200-MHz DDR 72-bit data interface, each processor can transfer 72 bits of data at 100MHz,whichismorethaneachprocessor RAMP White core can consume even in our maximum- RAMP White is a distributed-shared- sized configuration. A simple round-robin memory machine that will demonstrate scheme is used to arbitrate among the cores. RAMP’s open-component nature by in- The processor-processor network switch tegrating modules from RAMP Red, RAMP currently uses a simple interrupt-driven pro- Blue, and contributions from other RAMP grammed I/O approach. A Linux driver pro- participants. The initial version is being vides an Ethernet interface so that applica- designed and integrated at the University of ......

MARCH–APRIL 2007 53 ...... HOT CHIPS

Figure 3. High-level view of RAMP White.

Texas at Austin. The RAMP White effort section unit is very simple. Memory requests began in the summer of 2006, somewhat from the processor are divided into local after Red and Blue, and will be implemen- memory requests, global memory requests ted in the following phases. (both handled by memory), I/O requests (handled by the I/O module), and remote 1. Global distributed-shared memory with- requests (handled by the NIU). Remote re- out caches. All requests to remote global quests from the NIU are forwarded to the memory will be serviced directly from memory. Because the initial version of remote memory. Communication will RAMP White does not cache global loca- take place over a ring network. tions, incoming remote requests need not be 2. Ring-based snoopy coherency. The basic snooped by the processor. infrastructure of the cache-less system I/O will be handled by a centralized I/O will be expanded to include a snoopy subsystem mapped into the global address cache that will snoop the ring. space. Each processor will run a separate 3. Directory-based coherency. Adirectory- SMP-capable Linux that will take locks to based coherence engine eliminates the access I/O. The global memory support need for each cache to snoop all transac- then transparently handles shared I/O. Later tions but will use the same snoopy cache. versions will add a coherency support using a soft cache (emulated in FPGA). RAMP RAMP White will eventually be composed White’s first snoopy cache will be based on of processor units from the University of RAMP Red’s snoopy cache. It is possible Washington and RAMP Blue teams that will that some or all of the data in the emulated be connected through a simple ring network cache will actually reside in DRAM if there (Figure 3). For expediency, the initial RAMP is not sufficient space in the FPGA itself. In White will use embedded PowerPC proces- the coherent versions of RAMP White, the sors. Each processor unit will contain one intersection unit passes all incoming remote processor connected to an intersection unit requests to the coherent cache for snooping that provides connections to a memory con- before allowing the remote request to pro- troller (MCU), a network interface unit ceed to the next stage. (NIU), and I/O if the processor unit supports it. The NIU will be connected to a simple AMP represents the research commu- ring network. Rnity’s return to building hardware- The intersection unit switches requests and software systems. RAMP is designed to em- replies between the processor, local memory, body the right trade-offs of cost, perfor- I/O units, and the network. The initial inter- mance, density, and visibility for system ...... 54 IEEE MICRO ...... Simulation and emulation technologies Early computer architecture research relied on convincing argument or similar efforts stand to provide a much needed, scalable research vehicle simple analytical models to justify design decisions. Beginning in the for full-system multiprocessor research. early 1980s, computers became fast enough that simple simulations of architectural ideas could be performed. By the 1990s, and onward, computer architecture research relied extensively on software simulation. Many sophisticated software simulation frameworks exist, including References SimpleScalar,1 SimOS,2 RSIM,3 (http://www.virtutech.com), 1. D. Burger and T.M. Austin, The SimpleScalar Tool Set, Version ASIM,4 and M5.5 As our field’s research focus shifts to multicore, 2.0, tech. report 1342, Computer Sciences Dept., Univ. of multithreading systems, a new crop of multiprocessor full-system Wisconsin, Madison, 1997. simulators with accurate operating-system and I/O support (for example, 2. M. Rosenblum et al., ‘‘Complete Computer System Simulation: see http://www.ece.cmu.edu/simflex/flexus.html) have more recently The SimOS Approach,’’ IEEE Parallel and Distributed Technol- emerged.6,7 Software simulation has significantly changed the com- ogy, vol. 3, no. 4, Winter 1995, pp. 34-43. puter architecture research field because it is comparably easy to 3. V.S. Pai, P. Ranganathan, and S.V. Adve, RSIM Reference use, and it can be parallelized effectively by using separate program Manual, Version 1.0, tech. report 9705, Electrical and instances to simultaneously explore the design space of architectural Computer Engineering Dept., Rice Univ., July 1997. choices. 4. J. Emer et al., ‘‘ASIM: A Performance Model Framework,’’ Nevertheless, even for studying single-core architectures, software Computer, vol. 22, no. 2, Feb. 2002, pp. 68-76. simulation is slow to generate a single data point. Detailed simulations of 5. N.L. Binkert et al., ‘‘The M5 Simulator: Modeling Networked out-of-order microprocessors typically execute in thousands of instruc- Systems,’’ IEEE Micro, vol. 26, no. 4, Jul.-Aug. 2006, pp. 52-60. tions per second. Multiprocessor simulation tightens the performance 6. M.M.K. Martin et al., ‘‘Multifacet’s General Execution-Driven bottleneck because the simulators slow down commensurably as the Multiprocessor Simulator (GEMS) Toolset,’’ ACM SIGARCH Com- number of studied cores continues to rise. Several researchers have puter Architecture News, vol. 33, no. 4, Nov. 2005, pp. 92-99. explored mechanisms to speed up simulation. The first of these 7. N.L. Binkert, E.G. Hallnor, and S.K. Reinhardt, ‘‘Network- techniques relied on modifying the inputs to benchmarks used, to reduce Oriented Full-System Simulation Using M5,’’ Sixth Workshop their total running time.8 Later, researchers recognized that the repetitive on Computer Architecture Evaluation using Commercial Work- nature of program execution could be exploited to reduce the amount of loads (CAECW 03), Feb. 2003; http://www.eecs.umich.edu/ time on which a detailed microarchitectural model is exercised. The first ,stever/pubs/caecw03.pdf. technique to exploit this was basic block vectors.9 Later researchers 8. A. KleinOsowski and D. Lilja, MinneSpec: A New Spec Benchmark proposed techniques that continuously sample program execution to find Workload for Simulation-Based Computer Architecture Research, demonstrably accurate subsets.10 tech. report 02-08, ARCTiC Labs, Univ. of Minnesota, 2002; http:// But the challenges facing our field will find solutions only by www.arctic.umn.edu/papers/minnespec-cal-v2.pdf. innovating both hardware and software. To engage software researchers, 9. T. Sherwood, E. Perelman, and B. Calder, ‘‘Basic Block proposed new architectures must be usable for real software Distribution Analysis to Find Periodic Behavior and Simulation development. The possibility of FPGA prototyping and simulation Points in Applications,’’ Proc. Int’l Conf. Parallel Architectures acceleration has garnered the interest of computer architects for as and Compilation Techniques (PACT 01), IEEE CS Press, 2001, long as the technology has existed. Unfortunately, until recently, this pp. 3-14. avenue has met only limited success, because of the restrictive capacity 10. R. Wunderlich et al., ‘‘Smarts: Accelerating Microarchitecture of earlier-generation FPGAs and the relative ease of simulating Simulation via Rigorous Statistical Sampling,’’ Proc. 30th Ann. uniprocessor systems in software. An example of a large-scale FPGA Int’l Symp. Computer Architecture (ISCA 03), IEEE CS Press, prototyping effort is the Rapid Prototype Engine for Multiprocessors 2003, pp. 84-95. (RPM).11 The RPM system enabled flexible evaluation of the memory 11. K. Oner et al., ‘‘The Design of RPM: An FPGA-Based Multi- subsystem, but it was limited in scalability (eight processors) and did not processor Emulator,’’ Proc. 3rd ACM Int’l Symp. Field-Pro- execute operating system code. With current FPGA capacity, RAMP and grammable Gate Arrays (FPGA 95), ACM Press, 1995, pp. 60-66.

research. Moreover, since the system is not ture, operating systems, compilers, applica- frozen, we can use it to both rapidly evolve tions, and programming models will all and spread successful ideas across the benefit. We are planning a full public release community. Research in hardware architec- of the RAMP infrastructure in 2007. MICRO ......

MARCH–APRIL 2007 55 ...... HOT CHIPS

Acknowledgments Proc. 15th ACM/SIGDA Int’l Symp. Field- This work was funded in part by the Programmable Gate Arrays (FPGA 07), National Science Foundation, grant CNS- ACM Press, 2007, pp. 116-125. 0551739. Special thanks to Xilinx for its 6. B.D. Carlstrom et al., ‘‘The Software Stack continuing financial support and dona- for Transactional Memory: Challenges and tion of FPGAs and development tools. Opportunities,’’ First Workshop on Soft- We appreciate the financial support pro- ware Tools for Multicore Systems (STMCS vided by the Gigascale Systems Research 06), 2006; http://ogun.stanford.edu/,kunle/ Center. Thanks to IBM for its financial publications/tcc_stmcs2006.pdf. support through faculty fellowships and donation of processor cores, and to Sun Microsystems for processor cores. There is John Wawrzynek is a professor of electrical an extensive list of industry and academic engineering and computer sciences at the friends who have given valuable feedback University of California, Berkeley. He and guidance. We especially thank Arvind currently teaches courses in computer archi- (Massachusetts Institute of Technology) tecture, VLSI system design, and reconfigur- and Jan Rabaey (University of California, able computing. He is codirector of the Berkeley) for their advice. The work Berkeley Wireless Research Center and presented in this article is the effort of the principal investigator of the Research Accel- RAMP students and staff: Hari Angepat, erator for Multiple Processors (RAMP) Dan Burke, Jared Casper, Chen Chang, project. He holds a PhD and MS in Pierre-Yves Droz, Greg Gibeling, Alex computer science from the California In- Krasnov, Ken Lutz, Martha Mercaldi, Nju stitute of Technology and a MS in electrical Njoroge, Andrew Putnam, Andrew Schutlz, engineering from the University of Illinois, and Sewook Wee. Urbana-Champaign. He is a member of the IEEE and the ACM.

...... David Patterson is the Pardee Professor and References Director of RAD Lab at the University of 1. G. Gibeling, A. Schultz, and K. Asanovic, ‘‘RAMP California, Berkeley. His research interests Architecture and Description Language,’’ 2nd are in design and automatic management of Workshop Architecture Research Using FPGA datacenters and in hardware-software archi- Platforms,2006;http://cag.csail.mit.edu/,krste/ tectures of highly parallel microprocessors. papers/rampgateware-warfp2006.pdf. He is a fellow of IEEE; recipient of the IEEE 2. C. Chang, J. Wawrzynek, and R.W. Bro- von Neumann and IEEE Mulligan Educa- dersen, ‘‘BEE2: A High-End Reconfigurable tion medals; fellow and past president of the Computing System,’’ IEEE Design & Test, ACM; and a member of the American vol. 22, no. 2, Mar.-Apr. 2005, pp. 114- Academy of Arts and Sciences, the National 125. Academy of Engineering, the National 3. A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Academy of Sciences, and the Silicon Valley Saha, ‘‘Unlocking Concurrency: Multicore Engineering Hall of Fame. Programming with Transactional Memory,’’ ACM Queue, vol. 4, no. 10, Dec. 2006; Mark Oskin is an assistant professor at the http://acmqueue.com/modules.php?name5 University of Washington. His research inter- Content&pa5showpage&pid5444. ests include computer systems architecture, 4. L. Hammond et al., ‘‘Programming with performance modeling, and architectures for Transactional Coherence and Consistency emerging technologies. With a large team of (TCC),’’ Proc. 11th Int’l Conf. Architectural graduate students, he is the coinventor of Support for Programming Languages and WaveScalar, the first dataflow machine ca- Operating Systems (ASPLOS 04), ACM pable of executing applications written in Press, 2004, pp. 1-13. imperative languages, and brick and mortar, 5. S. Wee et al., ‘‘A Practical FPGA-Based a low-cost technique for manufacturing SoC Framework for Novel CMP Research,’’ devices. He received his PhD in computer ...... 56 IEEE MICRO science from the University of California, and computer science from the Massachu- Davis. setts Institute of Technology. He is a member of the IEEE and the ACM. Shih-Lien Lu has a BS in electrical engineering and computer sciences from Derek Chiou is an assistant professor at the the University of California, Berkeley, and University of Texas at Austin. His research an MS and a PhD, both in computer science interests include computer system simulation, and engineering, from the University of computer architecture, parallel computer California, Los Angeles. He served on the architecture, and Internet router architecture. faculty of the Electrical and Computer He received his PhD, SM, and SB degrees in Sciences Department at Oregon State Uni- electrical engineering and computer science versity from 1991 to 2001. In 1999, he took from the Massachusetts Institute of Technol- a two-year leave from OSU and joined Intel. ogy. He is a senior member of the IEEE and He is currently a principal research scientist a member of the ACM. in the microarchitecture lab of Intel’s Microprocessor Technology Lab in Oregon. Krste Asanovic´ is an associate professor in His research interests include computer the Department of Electrical Engineering and microarchitecture, circuits, and FPGA sys- Computer Science at MIT and a member of tems design. the MIT Computer Science and Artificial Intelligence Laboratory. His research interests Christoforos Kozyrakis is an assistant pro- include computer architecture and VLSI fessor of electrical engineering and computer design. Asanovic´ has a BA in electrical and science at Stanford University. His research information sciences from the University of focuses on architectural support for parallel Cambridge and a PhD in computer science computing, system security, and energy from the University of California, Berkeley. management. He is currently working on He is a member of the IEEE and the ACM. transactional memory techniques that can greatly simplify parallel programming for the average developer. He has a PhD in com- Direct questions and comments about puter science from the University of Califor- this article to John Wawrzynek, 631 Soda nia, Berkeley. Hall, Computer Science Division, Univer- sity of California, Berkeley, CA 94720- James C. Hoe is an associate professor of 1776; [email protected]. electrical and computer engineering at Carnegie Mellon University. His research For further information on this or any interests include computer architecture and other computing topic, please visit our high-level hardware description and synthe- Digital Library at http://www.computer. sis. He has a PhD in electrical engineering org/publications/dlib.

......

MARCH–APRIL 2007 57