CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing

Eric S. Chung, James C. Hoe, and Ken Mai

Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 {echung, jhoe, kenmai}@ece.cmu.edu

ABSTRACT general-purpose processors. Among the computing alter- FPGAs have been used in many applications to achieve natives today, Field Programmable Gate Arrays (FPGA) orders-of-magnitude improvement in absolute performance have been applied to many applications to achieve orders-of- and energy efficiency relative to conventional microproces- magnitude improvement in absolute performance and energy sors. Despite their promise in both processing performance efficiency relative to conventional (e.g., [11, and efficiency, FPGAs have not yet gained widespread ac- 6, 5]). A recent study [6] further showed that FPGA fabrics ceptance as mainstream computing devices. A fundamental can be an effective computing substrate for floating-point obstacle to FPGA-based computing today is the FPGA’s intensive numerical applications. lack of a common, scalable memory architecture. When de- While the accumulated VLSI advances have steadily im- veloping applications for FPGAs, designers are often directly proved the FPGA fabric’s processing capability, FPGAs responsible for crafting the application-specific infrastruc- have not yet gained widespread acceptance as mainstream ture logic that manages and transports data to and from the computing devices. A commonly cited obstacle is the diffi- processing kernels. This infrastructure not only increases culty in programming FPGAs using low-level hardware de- design time and effort but will frequently lock a design to velopment flows. A fundamental problem lies in the FPGA’s a particular FPGA product line, hindering scalability and lack of a common, scalable memory architecture for appli- portability. We propose a new FPGA memory architecture cation designers. When developing for an FPGA, a designer called Connected RAM (CoRAM) to serve as a portable has to create from bare fabric not only the application kernel bridge between the distributed computation kernels and the itself but also the application-specific infrastructure logic to external memory interfaces. In addition to improving per- support and optimize the transfer of data to and from ex- formance and efficiency, the CoRAM architecture provides ternal memory interfaces. Very often, creating or using this a virtualized memory environment as seen by the hardware infrastructure logic not only increases design time and ef- kernels to simplify development and to improve an applica- fort but will frequently lock a design to a particular FPGA tion’s portability and scalability. product line, hindering scalability and portability. Further, the support mechanisms which users are directly responsi- ble for will be increasingly difficult to manage in the future Categories and Subject Descriptors as: (1) memory resources (both on- and off-chip) increase in C.0 [Computer System Organization]: [System archi- number and become more distributed across the fabric, and tectures] (2) long-distance interconnect delays become more difficult to tolerate in larger fabric designs. General Terms Current FPGAs lack essential abstractions that one comes to expect in a general purpose computer—i.e., an Instruc- Design, standardization tion Set Architecture (ISA) that defines a standard agree- ment between hardware and software. From a computing Keywords perspective, a common architectural definition is a critical FPGA, abstraction, memory, reconfigurable computing ingredient for programmability and for application portabil- ity. To specifically address the challenges related to mem- 1. INTRODUCTION ory on FPGAs, the central goal of this work is to create a shared, scalable memory architecture suitable for future With power becoming a first-class architectural con- FPGA-based computing devices. Such a memory architec- straint, future computing devices will need to look beyond ture would be used in a way that is analogous to how general purpose programs universally access main memory through Permission to make digital or hard copies of all or part of this work for standard“loads”and“stores”as defined by an ISA—without personal or classroom use is granted without fee provided that copies are any knowledge of hierarchy details such as caches, memory not made or distributed for profit or commercial advantage and that copies controllers, etc. At the same time, the FPGA memory ar- bear this notice and the full citation on the first page. To copy otherwise, to chitecture definition cannot simply adopt what exists for republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. general purpose processors and instead, should reflect the FPGA’11, February 27–March 1, 2011, Monterey, California, USA. spatially distributed nature of today’s FPGAs—consisting Copyright 2011 ACM 978-1-4503-0554-9/11/02 ...$10.00. (e) Core Core CPU CPU FPGA FPGA Channels L1 L1 Memory Interconnect L2 Cache DRAM DRAM

Single-Chip Heterogeneous Processor CoRAM Reconfigurable (a) (b) Logic

x CoRAM x x x Figure 1: Assumed System Context. y y x x y x x y Edge of up to millions of interconnected LUTs and thousands of Memory LocalAddress embedded SRAMs [24]. Working under the above premises, WrData the guiding principles for the desired FPGA memory archi- RdData CoRAM tecture are: Control threads (a) (b) (c) (d) • The architecture should present to the user a common, virtualized appearance of the FPGA fabric, which en- Figure 2: CoRAM Memory Architecture. compasses reconfigurable logic, its external memory in- terfaces, and the multitude of SRAMs—while freeing de- and initializing the memory contents of an application prior signers from details irrelevant to the application itself. to its execution on the FPGA. • The architecture should provide a standard, easy-to-use 2.2 Architectural Overview mechanism for controlling the transport of data between memory interfaces and the SRAMs used by the applica- Within the boundaries of reconfigurable logic, the CoRAM tion throughout the course of computation. architecture defines a portable application environment that enforces a separation of concerns between computation and • The abstraction should be amenable to scalable FPGA on-chip memory management. Figure 2 offers a conceptual microarchitectures without affecting the architectural view view of how applications are decomposed when mapped into presented to existing applications. reconfigurable logic with CoRAM support. The reconfig- Outline. Section 2 presents an overview of the CoRAM urable logic component shown in Figure 2a is a collection memory architecture. Section 3 discusses a possible mi- of LUT-based resources used to host the algorithmic ker- croarchitectural space for implementing CoRAM. Section 4 nels of a user application. It is important to note that the demonstrates concrete usage of CoRAM for three exam- CoRAM architecture places no restriction on the synthesis ple application kernels—Black-Scholes, Matrix-Matrix Mul- language used or the internal complexity of the user ap- tiplication and Sparse Matrix-Vector Multiplication. Sec- plication. For portability reasons, the only requirement is tion 5 presents an evaluation of various microarchitectural that user logic is never permitted to directly interface with approaches. We discuss related work in Section 6 and offer off-chip I/O pins, access memory interfaces, or be aware of conclusions in Section 7. platform-specific details. Instead, applications can only in- teract with the external environment through a collection of specialized, distributed SRAMs called CoRAMs that pro- 2. CORAM ARCHITECTURE vide on-chip storage for application data (see Figure 2b). 2.1 System Context CoRAMs. Much like current FPGA memory architectures, CoRAMs preserve the desirable characteristics of conven- The CoRAM memory architecture assumes the co-existence tional fabric-embedded SRAM [16]—they present a simple, of FPGA-based computing devices along with general- wire-level SRAM interface to the user logic with local ad- purpose processors in the context of a shared memory mul- dress spaces and deterministic access times (see Figure 2b), tiprocessor system (see Figure 1). The CoRAM architec- are spatially distributed, and provide high aggregate on-chip ture assumes that reconfigurable logic resources will exist bandwidth. They can be further composed and configured either as stand-alone FPGAs on a multiprocessor memory with flexible aspect ratios. CoRAMs, however, deviate dras- bus or integrated as fabric into a single-chip heterogeneous tically from conventional embedded SRAMs in the sense multicore. Regardless of the configuration, it is assumed that the data contents of individual CoRAMs are actively that memory interfaces for loading from and storing to a managed by finite state machines called“control threads”as linear address space will exist at the boundaries of the re- shown in Figure 2c. configurable logic (referred to as edge memory in this paper). These implementation-specific edge memory interfaces could Control threads. Control threads form a distributed col- be realized as dedicated memory/bus controllers or even co- lection of logical, asynchronous finite state machines for me- herent cache interfaces. Like commercial systems available diating data transfers between CoRAMs and the edge mem- today (e.g., Convey Computer [8]), reconfigurable logic de- ory interface. Each CoRAM is managed by at most a single vices can directly access the same virtual address space of control thread, although a control thread could manage mul- general purpose processors (e.g., by introducing MMUs at tiple CoRAMs. Under the CoRAM architectural paradigm, the boundaries of fabric). The combined integration of vir- user logic relies solely on control threads to access exter- tual memory and direct access to the memory bus allows nal main memory over the course of computation. Control applications to be efficiently and easily partitioned across threads and user logic are peer entities that interact over general-purpose processors and FPGAs, while leveraging the predefined, two-way asynchronous channels (see Figure 2e). unique strengths of each respective device. A nearby proces- A control thread maintains local state to facilitate its ac- sor is useful for handling tasks not well-suited to FPGAs— tivities and will typically issue address requests to the edge e.g., providing the OS environment, executing system calls, memory interface on behalf of the application; upon comple- tion, the control thread informs the user logic through chan- /*** CoRAM handle definition and acquisition ***/ struct {int n; int width; int depth; ...} coh; nels when the data within specific CoRAMs are ready to be coh get_coram(instance_name, ...); accessed through their locally-addressed SRAM interfaces. coh append_coram(coh coram, bool interleave, ...); Conversely, the user logic can also write its computational results into CoRAMs and issue “commands” to the control /*** Singleton control actions ***/ threads via channels to write the results to edge memory. void coram_read(coh coram, void *offset, void *memaddr, int bytes); Control actions. To express the memory access require- tag coram_read_nb(coh coram, ...); ments of an FPGA application, control threads can only void coram_write(coh coram, void *offset, invoke a predefined set of memory and communication prim- void *memaddr, int bytes); tag coram_write_nb(coh coram, ...); itives called control actions. Control actions describe logi- void coram_copy(coh src, coh dst, void *srcoffset, cal memory transfer commands between specific embedded void *dstoffset, int bytes); CoRAMs and the edge memory interface. A control thread tag coram_copy_nb(coh src, coh dst, ...); at the most basic level comprises a sequence of control ac- bool check_coram_done(coh coram, tag, ...); tions to be executed over the course of a program. In gen- void coram_membar(); eral, a control thread issues control actions along a dynamic /*** Collective control actions ***/ sequence that can include cycles and conditional paths. void collective_write(coh coram, void *offset, Software versus RTL. An important issue that merits void *memaddr, int bytes); early discussion is in determining what the “proper” level void collective_read(coh coram, void *offset, void *memaddr, int bytes); of abstraction should be for expressing control threads and control actions. The most straightforward approach to ex- /*** Channel control actions ***/ posing control actions to user logic is to distribute standard, void fifo_write(fifo f, Data din); wire-level interfaces throughout the fabric. In this case, the Data fifo_read(fifo f); application designer would be directly responsible for con- void ioreg_write(reg r, Data din); structing hardware control threads (i.e., FSM) that generate Data ioreg_read(reg r); memory address requests on behalf of the user logic and is- Figure 3: Control action definitions. sue control actions through the standard interfaces. In this paper, we make the key observation that from a performance perspective, expressing control threads in a trol actions defined can be used to compose more sophis- low-level abstraction such as RTL is not a critical require- ticated memory “personalities” such as scratchpads, caches, ment. In many cases, the process of generating address re- and FIFOs—each of which are tailored to the memory pat- quests to main memory is not a limiting factor to FPGA ap- terns and desired interfaces of specific applications. plication performance since most time is either spent wait- The first argument to a control action is typically a ing for memory responses or for computation to progress. program-level identifier (called co-handle) for an individual For the remainder of this paper, it is assumed that control CoRAM or for a collection of CoRAMs that are function- threads are expressed in a high-level C-based language to ing as a single logical structure (in the same way embed- facilitate the dynamic sequencing of control actions. ded SRAMs can be composed). The co-handle encapsulates Our selection of a C-based language affords an application both static and runtime information of the basic width and developer not only simpler but also more natural expressions depth of the logical CoRAM structure and the binding to of control flow and memory pointer manipulations. The con- physical CoRAM blocks in the fabric. (A co-handle can also trol threads implemented in software would be limited to a be used to represent a channel resource such as a FIFO or subset of the C language and would exclude “software” fea- I/O register, although only the appropriate communication tures such as dynamic memory allocation. Any high-level control actions are compatible with them.) The CoRAM language used must be synthesizable to finite state machines definitions in Figure 3 comprise basic memory block trans- or even compiled to hard microcontrollers that can execute fer operations (coram read, coram write) as well as on-chip control thread programs directly if available in an FPGA. CoRAM-to-CoRAM data transfers (coram copy). In addi- Generally, the overall inefficiencies of executing a high-level tion to memory operations, control actions support asyn- language would not directly impede the overall computation chronous two-way communication between user logic and throughput because the control threads do not“compute”in control threads via FIFOs (e.g., fifo write) and I/O regis- any usual sense and are used only to generate and sequence ters (e.g., ioreg write). Although not listed in the defini- the control actions required by an application. tions, control threads can also communicate to each other through ordered message-passing primitives. 2.3 CoRAM Architecture in Detail Example. To illustrate how control actions are used, the In this section, we describe in detail the standard mem- example in Figure 4 shows how a user would (1) instantiate ory management interface exported by control actions, and a CoRAM as a black-box module in their application, and how they are invoked within control threads. Figure 3 illus- (2) program a corresponding control thread to read a single trates the set of control actions available to an application data word from edge memory into the CoRAM. The control developer. The control actions shown have the appearance thread program shown in Figure 4 (right) first acquires a co- of a memory management API, and abstract away the de- handle (L2), and passes it into a coram write control action tails of the underlying hardware support—similar to the role (L3), which performs a 4-byte memory transfer from the served by the Instruction Set Architecture (ISA) between edge memory address space to the CoRAM blocks named software and evolving hardware implementations. As will by the co-handle. To inform the application when the data be demonstrated later in Section 4, the basic set of con- is ready to be accessed for computation, the control thread (1) Application (Verilog) (2) Control thread program CoRAM

C C C MemoryInterface 1 module top(clk, rst , …); 1 read_data(void *src) { R R R 2 coram c0(…, Address, 2 coh c0 = get_coram (“c0”); 3 WrData, 3 coram_write( 4 RdData); c0, // handle C C Control Unit C 0xA, // local address

5 fifo f0(…, FifoData); Interface Memory R R R src, // edge address 6 … 4, // size in bytes); NoC Router

7 endmodule 4 fifo_write(f0, 0xdeadbeef); MemoryInterface 5 } Reconfigurable Logic Edge Memory Address Space C C C

Address R R R Memory Interface Memory FPGA WrData Application RdData CoRAM (1) coram_write 1 Control FifoData Thread (2) fifo_write 2 Figure 5: Conceptual Microarchitecture Sketch of Reconfigurable Logic with CoRAM Support.

CoRAMs and the edge memory interfaces. The CoRAM Figure 4: Example Usage of CoRAMs. architecture definitions (i.e., control actions) form a con- tractual agreement between applications and hardware im- passes a token to the user logic using the fifo write control plementations. In the ideal case, a good implementation of action (the acquisition of the channel FIFO’s co-handle is CoRAM should provide robust hardware performance across omitted for brevity). a range of applications without affecting correctness and Advanced control actions. Control threads can employ without significant tuning required by the user. more advanced control actions to either increase memory Hardware overview. By construction, the CoRAM archi- level parallelism or to customize data partitioning. The tecture naturally lends itself to highly distributed hardware non-blocking control actions (e.g., coram write nb) explic- designs. The microarchitectural “sketch” shown in Figure 5 itly allow for concurrent memory operations and return a tag is architected in mind to scale up to thousands of embed- that must be monitored using check coram done. A control ded CoRAMs based on current FPGA design trends [24]. thread can also invoke coram membar, which serializes on CoRAMs, like embedded SRAMs in modern FPGAs, are all previously executed non-blocking control actions by that arranged into vertical columns [26] and organized into local- control thread. For parallel transfers to a large number of ized clusters. Each cluster is managed by an attached Con- CoRAMs, a collective form of read and write control actions trol Unit, which is a physical host responsible for executing is also supported. In the collective form, append handle is a the control programs that run within the cluster. Control helper function that can be used to compose a static list of programs can be realized by direct synthesis into reconfig- CoRAMs. The newly returned co-handle can then be used urable logic (e.g., using high-level synthesis flows) or can be with collective control actions to perform transfers to the compiled and executed on dedicated multithreaded micro- aggregated CoRAMs as a single logical unit. When oper- controllers (e.g., a multithreaded RISC pipeline). Control ating upon the composed handle, sequential data arriving Units must also maintain internal queues and scoreboarding from memory can either be striped across the CoRAMs’ lo- logic to track multiple outstanding control actions, and to cal addresses in a concatenated or word-interleaved pattern. allow querying of internal state (e.g., check coram done). Such features allow the user to customize the partitioning of application data across the multiple distributed CoRAMs Data distribution. An integral component to the Control within the reconfigurable logic. Unit is the network-on-chip, which is responsible for rout- ing memory address requests on behalf of Control Units to Future enhancements. It is not difficult to imagine that a multitude of distributed memory interfaces and delivering many variants of the above control actions could be added to data responses accordingly. Within each cluster, multiple support more sophisticated patterns or optimizations (e.g., CoRAMs share a single network-on-chip router for commu- broadcast from one CoRAM to many, prefetch, strided ac- nication and data transport. At the macroscale level, mul- cess, etc.). In a commercial production setting, control tiple routers are arranged in a 2D mesh to provide global actions—like instructions in an ISA—must be carefully de- connectivity between clusters and memory interfaces. Inter- fined and preserved to achieve the value of portability and nal to each cluster, queues and local interconnect provides compatibility. Although beyond the scope of this work, com- connectivity between the CoRAMs and the shared router pilers could play a significant role in static optimization of interface. The local interconnect internal to the cluster also control thread programs. Analysis could be used, for ex- contains marshalling logic to break large data transfers from ample, to identify non-conflicting control actions that are the network into individual words and to steer them accord- logically executed in sequence but can actually be executed ingly to the constituent CoRAMs based on the data parti- concurrently without affecting correctness. tioning a user desires (e.g., a collective write). Soft versus hard logic. To implement the CoRAM ar- 3. CORAM MICROARCHITECTURE chitecture, the most convenient approach in the short term The CoRAM architecture presented thus far has deliber- would be to layer all the required CoRAM functionality on ately omitted the details of how control threads are actually top of conventional FPGAs. In the long term, FPGAs de- executed and how data is physically transported between the veloped in mind with dedicated CoRAM architectural sup- Stock price S 1 coh ram = get_coram(...); // handle to FIFO buffer Option price X Processing 2 char *src = ...; // initialized to Black-Scholes data Element Put value (P) Expiration time T 3 int src_word = 0, head = 0, words_left = ...; // size (Black-Scholes Call value (C) Risk-free R Formula) 4 while(words_left > 0) { Volatility V 5 int tail = ioreg_read(ram); 6 int free_words = ram->depth - (head - tail); From To 7 int bsize_words = MIN(free_words, words_left); Memory PE Memory 8 9 if(bsize_words != 0) { 10 coram_write(ram, head, src + PE resources (single-precision) ˜30KLUTs, 57 DSPs Clockfrequency 300MHz 11 src_word * ram->wdsize, bsize_words); PeakPEthroughput 600Moptions/s 12 ioreg_write(ram, head + bsize_words); PEmemorybandwidth 8.4GB/s 13 src_word += bsize_words; 14 words_left -= bsize_words; 15 head += bsize_words; Figure 6: Black-Scholes Processing Element. 16 } 17 }

Cache Personality Control thread Figure 8: Control program for Input Stream FIFO. Address Lookup Logic Tag Array Hit/miss Tag Check (1) Check for cache miss simulator, which models a microarchitecture based on Fig- (CoRAM) (read_fifo ) Miss ure 5. For our applications, the compute portions of the Address (2) Perform cache fill designs were placed-and-routed on a Virtex-6 LX760 FPGA ( ) Fill done coram_write to determine the peak fabric processing throughput. (3) Acknowledge RdAddr (fifo_write ) 4.1 Black-Scholes Data Data Array (CoRAM) The first FPGA application example, Black-Scholes, is widely used in the field of quantitative finance for option FPGA pricing [21]. Black-Scholes employs a rich mixture of arith- metic floating-point operators but exhibits a very simple memory access pattern. The fully pipelined processing ele- Figure 7: Stream FIFO memory personality. ment (PE) shown in Figure 6 consumes a sequential input data stream from memory and produces its output stream port can become more economical if certain features be- similarly. The application’s performance is highly scalable; come popularly used. From the perspective of a fabric de- one could increase performance by instantiating multiple signed to support computing, we contend that a hardwired PEs that consume and produce independent input and out- network-on-chip (NoC) offers significant advantages, espe- put data streams. Performance continues to scale until ei- cially if it reduces or eliminates the need for long-distance ther the reconfigurable logic capacity is exhausted or the routing tracks in today’s fabrics. Under the CoRAM ar- available external memory bandwidth is saturated. The chitectural paradigm, global bulk communications are re- characteristics of our Black-Scholes PE are shown in Fig- stricted to between CoRAM-to-CoRAM or CoRAM-to-edge. ure 6 (bottom). Such a usage model would be better served by the high per- formance (bandwidth and latency) and the reduced power Supporting Streams with CoRAM. To support the se- and energy from a dedicated hardwired NoC that connects quential memory access requirements of Black-Scholes, we the CoRAMs and the edge memory interfaces. With a hard- develop the concept of a stream FIFO“memory personality”, wired NoC, it is also more cost-effective in area and energy which presents a simple FIFO interface to user logic (i.e., to over-provision network bandwidth and latency to deliver data, ready, pop). The stream FIFO employs CoRAMs and robust performance across different applications. Similarly, a control thread to bridge a single Black-Scholes pipeline to the control units used to host control threads could also the edge memory interface. Figure 7 illustrates the stream support “hardened” control actions most commonly used by FIFO module, which instantiates a single CoRAM to be applications. The microarchitecture shown in Figure 5, with used as a circular buffer, with nearby head and tail registers its range of parameters, will be the subject of a quantitative instantiated within reconfigurable logic to track the buffer evaluation in Section 5. We next present three case studies occupancy. Unlike a typical FIFO implemented within re- that demonstrate concrete usage of CoRAM. configurable logic, the producer of the stream FIFO is not an entity hosted in logic but is managed by an associated control thread. 4. CoRAM IN USAGE Figure 7 (right) illustrates a simple control thread used The CoRAM architecture offers a rich abstraction for ex- to fulfill the FIFO producer role (the corresponding code pressing the memory access requirements of a broad range of is shown in Figure 8). The event highlighted in Step 1 of applications while hiding memory system details unrelated Figure 7 first initializes a source pointer to the location in to the application itself. This section presents our expe- memory where the Black-Scholes data resides (L2 in Fig- riences in developing three non-trivial application kernels ure 8). In Step 2, the control thread samples the head and using the CoRAM architecture. Below, we explain each ker- tail pointers to compute how much available space is left nel’s unique memory access pattern requirements and dis- within the FIFO (L5-L7 in Figure 8). If sufficient space ex- cuss key insights learned during our development efforts. ists, the event in step 3 performs a multi-word byte transfer The control thread examples presented in this section are from the edge memory interface into the CoRAM using the excerpts from actual applications running in our CoRAM coram write control action (L10 in Figure 8). The event in 1 void mmm(Data* A, Data* B, Data *C, int N, int NB) { 1. void mmm_control(Data *A, Data *B, ...) { 2 int j, i, k; 2. coh ramsA = ..; // ‘a’ CoRAMs, word-interleaved 3 for (j = 0; j < N; j += NB) { 3. coh ramsB = ..; // ‘b’ CoRAMs, concatenated 4 for (i = 0; i < N; i += NB) { 4. for (j = 0; j < N; j += NB) { 5 for (k = 0; k < N; k += NB) { 5. for (i = 0; i < N; i += NB) { 6 block_mmm_kernel(A + i*N + k, 6. for (k = 0; k < N; k += NB) { 7 B + k*N + j, 7. fifo_read(...); 8 C + i*N + j, N, NB); 8. for (m = 0; m < NB; m++) { 9 } 9. collective_write(ramsA, m*NB, A + i*N+k + 10 } m*N, NB*dsz); 11 } 10. collective_write(ramsB, m*NB, B + k*N+j + 12 } m*N, NB*dsz); 11. } Figure 9: Reference C code for Blocked MMM [4]. 12. fifo_write(...); 13. } 14. } PE PE PE subA PE Processing Element 15. } 0 1 3 2 (CoRAM a)(CoRAM a)(CoRAM (CoRAM a)(CoRAM CRMa) (CoRAM

CoRAM CoRAM a Figure 11: MMM control thread code example. x + c subB CoRAM b

PE 0 (CoRAM b) PE 0 (CoRAM c) tions of sub-matrices sized to fit within the on-chip SRAMs

PE 1 (CoRAM b) PE 1 (CoRAM c) (see reference C code in Figure 9). This strategy improves 4 PE PE PE PE (CoRAM b) PE (CoRAM c) 2 2 x + x + x + the arithmetic intensity of MMM by increasing the average PE (CoRAM b) PE (CoRAM c) 3 3 number of floating-point operations performed for each ex- 4 subC FPGA ternal memory byte transferred. 4x4 block size Figure 10 illustrates a parameterized hardware kernel de- Single PE resources ˜1KLUTs, 2 DSPs veloped for single-precision blocked MMM. The design em- Clockfrequency 300MHz ploys p identical dot-product processing elements (PE) and Peak PE throughput 600 MFLOP/s assumes that the large input matrices A, B, and the output PE memory bandwidth 2.4/nPEs + 2.4/wdsPerRAM (GB/s) matrix C are stored in external memory. In each iteration: (1) different sub-matrices subA, subB, and subC are read Figure 10: FPGA design for Blocked MMM. in from external memory, and (2) subA and subB are mul- tiplied to produce intermediate sums accumulated to sub- Step 4 completes the FIFO production by having the control matrix subC . The sub-matrices are sized to utilize available thread update the head pointer using the ioreg write control SRAM storage. (Square sub-matrices of size 4x4 are as- action to inform the reconfigurable logic within the stream sumed in this explanation for the sake of simplicity.) The FIFO module when new data has arrived (L12 in Figure 8). row slices of subB and subC are divided evenly among the p Finally, L13-L15 show updates to internal state maintained PEs and held in per-PE local CoRAM buffers. The column by the control thread. (Note: we do not show the control slices of subA are also divided evenly and stored similarly. A thread program for the corresponding output stream FIFO.) complete iteration repeats the following steps p times: (1) Discussion. The use of CoRAM simplified the overall de- each PE performs dot-products of its local slices of subA and velopment efforts for Black-Scholes by allowing us to express subB to calculate intermediates sum to be accumulated into the application’s memory access pattern at a high level of subC , and (2) each PE passes its local column slice of subA abstraction relative to conventional RTL flows. Our control to its right neighbor cyclically (note: as an optimization, thread program described the sequential memory transac- step 2 can be overlapped with step 1 in the background). tions required by Black-Scholes as a sequence of untimed MMM Control thread. To populate the CoRAM buffers steps using a simple, C-based language. This abstraction of each PE, a single control thread is used to access all of was simple to work with and allowed us to iterate on changes the necessary per-row and per-column slices of subA, subB, quickly and conveniently. An important idea that emerged and subC. The pseudo-code of the control thread program during our development efforts was the memory personality used to implement the required memory operations is shown concept. A memory personality “wraps” CoRAMs and con- in Figure 11 (for brevity, the reading and writing of subC is trol threads within reconfigurable logic to provide an even omitted). In L2, ramsA is a co-handle that represents an higher level of abstraction (i.e., interface and memory se- aggregation of all the ‘a’ CoRAMs belonging to the PEs mantics) best suited to the application at hand. As will in word-interleaved order (ramsA can be constructed from be discussed later, many other kinds of memory personali- multiple CoRAMs using append handle). If passed into a ties can also be constructed and further combined to form a collective write control action, matrix data arriving sequen- re-usable shared library for CoRAM. tially from edge memory would be striped and written across multiple ‘a’ CoRAMs in a word-by-word interleaved fash- 4.2 Matrix-Matrix Multiplication ion. This operation is necessary because the ‘a’ CoRAMs The next example, Matrix Matrix Multiplication (MMM), expect the data slices in column-major format whereas all is a widely-used computation kernel that multiplies two ma- matrix data is encoded in row-major. The co-handle ramsB trices encoded in dense, row-major format [18]. For multipli- expects data in a row-major format and is simply a concate- cations where the input and output matrices are too large to nation of all of the ‘b’ CoRAMs. Within the body of the fit in on-chip SRAMs, a commonly used blocked algorithm inner loop, the control thread first waits for a token from decomposes the large calculation into repeated multiplica- the user logic (L7) before executing the collective write con- 0 1 2 3 rows (sequential) 0 1 0 8 1 8 5 23 7 values Work Scheduler 5 0 0 0 1 3 0 13 2 cols 0 2 0 3 vals, cols vals[i] cols[i] (sequential) PE 0 0 7 0 rows A matrix 0 2 3 5 FIFO x vector (random x[cols[i]] 1. void spmv_csr (int n_rows, int *cols, x 2. Data *rows, Data *x, Data *y) { access) Cache 3. for(int r = 0; r < n_rows; r++) { + 4. int sum = 0; 5. for(i = rows[r]; i <= rows[r+1]; i++) y vector sum 6. sum += vals[i] * x[cols[i]]; (sequential) y[r] 7. y[r] = sum; PE FPGA 8. } 9. } Edge sum += vals[i] * x[cols[i]]; Memory Figure 12: Sparse Matrix-Vector Multiplication. Single PE resources ˜2KLUTs trol actions used to populate the CoRAMs (L9-L10). Upon Clockfrequency 300MHz completion, the control thread informs the user logic when Peak PE throughput 600 MFLOP/s PE memory bandwidth ≥3.6GB/s the data is ready to be accessed (L12). The control thread terminates after iterating over all the blocks of matrix C. Figure 13: FPGA design for SpMV. Discussion. Our overall experience with developing MMM highlights more sophisticated uses of CoRAM. In particular, configure the memory pointers for each of the two stream FI- the collective control actions allowed us to precisely express FOs, the PE logic must continously pass row assignment in- the data transfers for a large collection of CoRAMs in very formation (via channels) to the control threads belonging to few lines of code. Control actions also allowed us to cus- the stream FIFOs. To calculate each term of a dot-product, tomize the partitioning of data to meet the on-chip memory a PE must make an indirect reference to vector x (L6). This layout requirements of our MMM design. It is worth noting type of memory access poses a particularly difficult challenge how the code in Figure 11 appears similar to the C reference because the memory addresses (i.e., cols) are not necessarily code, with the exception that the inner-most loop now con- sequential and could be accessed randomly. Furthermore, x sists of memory control actions rather than computation. can be a very large data structure that cannot fit into ag- One insight we developed from this example is that con- gregate on-chip memory. Unlike the MMM example from trol threads can allow us to easily express the re-assignment earlier, the performance of SpMV is highly dependent on of FPGA kernels to different regions of the external mem- efficient bandwidth utilization due to its low arithmetic in- ory over the course of a large computation. This feature of tensity. An optimization to reduce memory bandwidth is to the CoRAM architecture could potentially be used to sim- exploit any reuse of the elements of x across different rows plify the task of building out-of-core FPGA-based applica- through caching. tions that support inputs much larger than the total on-chip To implement caching within each PE, Figure 14 illus- memory capacity. trates a read-only cache memory personality built using the CoRAM architecture. Within the cache, CoRAMs are com- 4.3 Sparse Matrix-Vector Multiplication posed to form data and tag arrays while conventional recon- Our last example, Sparse Matrix-Vector Multiplication figurable logic implements the bulk of the cache controller (SpMV), is another widely-used scientific kernel that multi- logic. A single control thread is used to implement cache plies a sparse matrix A by a dense vector x [18]. Figure 12 fills to the CoRAM data arrays. When a miss is detected, (top) gives an example of a sparse matrix A in Compressed the address is enqueued to the control thread through an Sparse Row (CSR) format. The non-zero values in A are asynchronous FIFO (step 1 in Figure 14). Upon a pending stored in row-order as a linear array of values in external request, step 2 of the control thread transfers a cache block’s memory. The column number of each entry in values is worth of data to the data array using the coram write con- stored in a corresponding entry in the column array (cols). The i’th entry of another array (rows) holds the index to Cache Personality Control thread the first entry in values (and cols) belonging to row i of A. Address Lookup Logic Tag Array Figure 12 (bottom) gives the reference C code for comput- Hit/miss Tag Check (1) Check for cache miss (CoRAM) (read_fifo ) ing A×x. Of all the kernels we studied, SpMV presented the Miss most difficult design challenge for us due to a large external Address (2) Perform cache fill memory footprint and an irregular memory access pattern. Fill done (coram_write ) (3) Acknowledge Figure 13 illustrates an FPGA design for SpMV, where RdAddr (fifo_write ) multiple processing elements (PEs) operate concurrently on Data Data Array distinct rows of A. The contents of the rows array are (CoRAM) streamed in from edge memory to a centralized work sched- FPGA uler that assigns rows to different PEs. For each assigned row, a PE employs two stream FIFOs to sequentially read in data blocks from the values and cols arrays, respectively. To Figure 14: SpMV cache personality. trol action. In step 3, the control thread acknowledges the Variables Selected parameters cache controller using the fifo write control action. FPGA resources 474KLUTs, 720 CoRAMs, 180 threads Clock rates User Logic @ 300MHz, CoRAM @ 0.3-1.2GHz Discussion. The SpMV example illustrates how different Off-chip DRAM 128GB/sec, 60ns latency memory personalities built out of CoRAM can be employed Control Unit 8-thread, 5-stage MIPS core in a single design to support multiple memory access pat- Network-on-Chip 2D-mesh packet-switched NoC 3-cycle hop latency, 300MHz-1.2GHz terns. Caches, in particular, were used to support the ran- Topology {16,32,64} nodes x {8,16} memports dom access pattern of the x vector, whereas the stream FI- Router datapath {32,64,128}-bit router datapaths FOs from Black-Scholes were re-used for the remaining se- quential accesses. In our development efforts of SpMV, the Table 1: FPGA Model Parameters. instantiation of CoRAM-based memory personalities along with spatially-distributed PEs allowed us to quickly instan- System Verilog [3] to simulate instances of the design illus- tiate “virtual taps” to external memory wherever needed. trated in Figure 5. Table 1 summarizes the key configuration This level of abstraction was especially convenient as it al- parameters of the simulated fabric design. The simulator lowed us to concentrate our efforts on developing only the assumes a baseline reconfigurable fabric with 474 KLUTs processing components of SpMV. and 720 CoRAMs (4KB each) to reflect the capacity of 4.4 Case Study Remarks the Virtex-6 LX760, the largest FPGA from Xilinx today. In the performance simulations, we assume the user logic The overall experience in developing applications with within the fabric can operate at 300MHz based on placed- CoRAM reveals significant promise in improving the pro- and-routed results of the processing kernels from Section 4. grammability of FPGA-based applications on the whole. For design space exploration, we considered different points While the CoRAM architecture does not eliminate the ef- based on CoRAMs assumed to operate at 300 MHz to fort needed to develop optimized stand-alone processing ker- 1.2 GHz to study the relative merits between soft- versus nels, it does free designers from having to explicitly manage hard-logic implementations. memory and data distribution in a low-level RTL abstrac- The simulator models a 2D-mesh packet-switched network- tion. Specifically, control threads gave us a general method on-chip based on an existing full-system simulator [22]. The for dynamically orchestrating the memory management of NoC simulator models the cycle-by-cycle traffic flow among applications, but did not force us to over-specify the se- the CoRAMs and the edge memory traffic resulting from quencing details at the RTL level. Control threads also did the control actions issued by control threads. The band- not limit our ability to support fine-grained interactions be- width available at each network endpoint is shared by a tween processing components and memory. Fundamentally, concentrated cluster of locally connected CoRAMs. For de- the high-level abstraction of CoRAM is what would enable sign space exploration, we varied the NoC performance along portability across multiple hardware implementations. three dimensions: (1) operating frequency between 300MHz The memory personalities developed in our examples to 1.2GHz (in sync with the CoRAM frequency); (2) net- (stream FIFO, read-only cache) are by no means sufficient work link and data width (32, 64, 128 bits per cycle); and for all possible applications and only highlighted a subset of (3) number of network end-points (16, 32, 64) and thus the the features possible. For example, for all of the memory number of CoRAMs sharing an end-point. personalities described, one could replace the coram write To incorporate memory bandwidth constraints, our sim- control action with coram copy, which would permit trans- ulator models an aggressive external memory interface with fers from other CoRAMs within the fabric and not neces- four GDDR5 controllers, providing an aggregate peak band- sarily from the memory edge. Such control actions could, width of 128GB/sec at 60ns latency. This external memory for instance, be used to efficiently compose multi-level cache interface performance is motivated by what can be achieved hierarchies out of CoRAMs (e.g., providing the SpMV PEs by GPUs and CPUs today. Even then, the performance sim- with a shared L2 cache) or be used to setup streams be- ulations of the Sparse Matrix-Vector Multiplication (SpMV) tween multiple application kernels. It is conceived that in kernels reported in this section are all memory bandwidth the future, soft libraries consisting of many types of memory bound, never able to completely consume the reconfigurable personalities would be a valuable addition to the CoRAM logic resources available in the fabric. architecture. This concept is attractive because hardware The Control Unit from Figure 5 is modeled as a multi- vendors supporting the CoRAM abstraction would not have threaded microcontroller that executes control thread pro- to be aware of all the possible personality interfaces (e.g., grams directly. The model assumes that the fabric has caches, FIFOs) but would only be implementing the “low- 740kB of SRAM (beyond CoRAMs) needed to hold the 180 level” control actions, which would automatically facilitate control thread contexts (code and data). The performance compatibility with existing, high-level libraries. simulator also does not model the multithreaded pipelines explicitly. However, the control threads are compiled and ex- 5. EVALUATION ecuted concurrently as Pthreads within the simulation pro- This section presents our initial investigations to under- cess and are throttled with synchronization each simulated stand the performance, area, and power implications of in- cycle to mimic execution delay (varied between 300MHz and troducing the CoRAM architecture to a conventional FPGA 1.2GHz for design space exploration). fabric. Given the large design space of CoRAM, the goal of RTL Synthesis. For each of the design points considered this evaluation is to perform an early exploration of design by the performance simulation, the power and area is also points for future investigations. estimated by synthesizing RTL-level designs of the CoRAM Methodology. To model an FPGA with CoRAM support, mechanisms using Synopsys Design Compiler v2009 to tar- a cycle-level software simulator was developed in Bluespec get a commercial 65nm standard cell library. The power CoRAM Area Overhead vs Performance C = CoRAMs, T = control threads, N = nodes = clusters, M=mbanks 35 K = C/N = CoRAMs per cluster, X = T/N = control threads/cluster 30 P = watts, p = watts/GHz, freq = CoRAM clock frequency 25 X 20 Acluster = Arouter + X × Acostate + 8 × (Amips + Aother ) Atotal = N × Acluster + M × Arouter 15 X P = freq × (prouter + X × pcostate + × (pmips + p )) 10 cluster 8 other (mm2) Area P = N × P + freq × M × p 5 total cluster router 300MHz 600MHz 900MHz 1200MHz 0 0 1 2 3 4 5 6 7 8 91011 12 Table 2: Area and Power Overhead Formulas. Performance (GFLOP/s) CoRAM Power Overhead vs Performance 2 Component mm mW/GHz 14 300MHz 600MHz MIPScore(8threads) .08 59 12 900MHz 1200MHz Per-threadstate(SRAM) .04 8 10 32-bitrouter .07 38 8 64-bitrouter .11 48 128-bitrouter .30 64 6 Power (W) Power Other (queues, logic, etc.) (est’d) .40 - 4 900MHz, 4x4 nodes, 16 mbanks, 128-bit router 2 Table 3: 65nm Synthesis Results. 0 0 1 2 3 4 5 6 7 8 91011 12 Performance (GFLOP/s) and area of the NoC are estimated by synthesizing an open- 1 sourced router design . The power and area of the Control Figure 15: Estimated area and power overhead. Unit is estimated by synthesizing an 8-way multithreaded 5- stage MIPS core automatically generated from the T-piper In Figure 15 (top), points in the lower-right corner corre- tool [17]. The power and area of the SRAMs for the control spond to higher performance and lower area overhead. For thread contexts are estimated using CACTI 4.0 [20]. Table 3 all of the frequencies, a nearly minimal area design point summarizes the resulting area and power characteristics of achieves almost the best performance possible at a given fre- various components. The total power and area overhead for quency. This suggests that the operating frequency of the a particular design point is calculated using the formulas in CoRAM mechanisms has a first-order impact on the overall the bottom of Table 2 by setting the parameters in the top application performance, beyond the impact of microarchi- portion of the table. tectural choices. This result suggests that it may be difficult It is important to keep in mind that, despite the efforts for soft implementations of the CoRAM mechanisms to per- taken, the reported power and area estimates are only ap- form comparably well as hard implementations in the future proximate. However, by just comparing relative magnitudes, even when reconfigurable logic resources are made plentiful. the estimates give an adequate indication that the total cost The power-overhead-vs-performance plot in Figure 15 of adding even an aggressive, pure hard-logic CoRAM im- (bottom) exhibits a cleaner Pareto front. In this plot, the plementation is small relative to the inherent cost of a re- higher performance design points tend to require a greater configurable fabric like the Virtex-6 LX760. Our estimates power overhead. It should be pointed out that the microar- also have a large conservative margin built-in since they are chitecture design point 4x4nodes-16mbanks-128bit appears based on a 65nm standard cell library several years old. consistently on the Pareto front for all frequencies (also op- timal in the area-overhead-vs-performance plot). This sug- 5.1 Design Space Exploration Results gests that it is possible to select this point to achieve mini- From the design explorations, we report results for the mum area overhead and apply frequency scaling to span the Sparse Matrix-Vector Multiplication kernel, which was our different positions on the power-performance Pareto optimal most memory-intensive application (results for Matrix Ma- front. However, this conclusion is based only on the results trix Multiplication and Black-Scholes are discussed qualita- of the SpMV application kernel. Further study including tively at the end). We simulated the execution of the SpMV a much greater range of application kernels is needed. Al- kernel running on design points generated from an exhaus- though performance results were not presented for MMM tive sweep of parameters given in Table 1. For each design or BS, our experiments showed that less aggressive de- point, we report the SpMV kernel’s execution performance signs (e.g., 600MHz-4x4nodes-16mbanks-128bit) were suffi- averaged over test input matrices from [9]. The graphs in cient for these compute-bound applications to reach peak Figure 15 plot the performance (GFLOP/sec) achieved by performance running on the CoRAM architecture. each design point (on the x-axis) against its area and power overheads (on the y-axis) from adding the CoRAM archi- tecture support. The data points represented by the same 6. RELATED WORK markers correspond to design points with CoRAM mecha- A large body of work has explored specialized VLSI de- nisms at the same frequency (300MHz, 600MHz, 900MHz or signs for reconfigurable computing. GARP [12] is an ex- 2 1.2GHz). All of the design points incur a fixed 18mm from ample that fabricates a MIPS core and cache hierarchy SRAM storage and the MIPS cores for the required 180 con- along with a collection of reconfigurable processing elements trol thread contexts; this can be a very significant portion (PE). The PEs share access to the processor cache but only of the total area overhead for some design points. Neverthe- through a centralized access queue at the boundary of the less, the total area overhead is small in comparison to the reconfigurable logic. Tiled architectures (e.g., Tilera [23]) 2 hundreds of mm typical of even small FPGAs today [1]. consist of a large array of simple von Neumann processors instead of fine-grained lookup tables. The memory accesses 1 http://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/Resources/Router by the cores are supported through per-core private caches interconnected by an on-chip network. Smart Memories [14] FPGAs. In SASP ’08: Proceedings of the 2008 Symposium on on the other hand employs reconfigurable memory tiles that Application Specific Processors, pages 101–107, Washington, selectively act as caches, scratchpad memory, or FIFOs. DC, USA, 2008. IEEE Computer Society. [6] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-Chip The idea of decoupling memory management from compu- Heterogeneous Computing: Does the Future Include Custom tation in CoRAM has been explored previously for general- Logic, FPGAs, and GPGPUs? In MICRO-43: Proceedings of purpose processors [19, 7]. Existing work has also examined the 43th Annual IEEE/ACM International Symposium on Microarchitecture, 2010. soft memory hierarchies for FPGAs (e.g., [25, 10, 13, 15]). [7] E. Cohler and J. Storer. Functionally parallel architecture for The most closely related work to CoRAM is LEAP [2], which array processors. Computer, 14:28–36, 1981. shares the objective of providing a standard abstraction. [8] Convey, Inc. http://www.convey.com. LEAP abstracts away the details of memory management by [9] T. A. Davis. University of Florida Sparse Matrix Collection. NA Digest, 92, 1994. exporting a set of timing-insensitive, request-response inter- [10] H. Devos, J. V. Campenhout, and D. Stroobandt. Building an faces to local client address spaces. Underlying details such Application-specific Memory Hierarchy on FPGA. 2nd as multi-level caching and data movement are hidden from HiPEAC Workshop on Reconfigurable Computing, 2008. the user. The CoRAM architecture differs from LEAP by [11] T. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V. Kindratenko, and D. Buell. The Promise of allowing explicit user control over the lower-level details of High-Performance Reconfigurable Computing. Computer, data movement between global memory interfaces and the 41(2):69 –76, feb. 2008. on-die embedded SRAMs; the CoRAM architecture could [12] J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with itself be used to support the data movement operations re- a reconfigurable coprocessor. In FCCM’97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing quired in a LEAP abstraction. Machines, page 12, Washington, DC, USA, 1997. IEEE Computer Society. [13] G. Kalokerinos, V. Papaefstathiou, G. Nikiforos, S. Kavadias, 7. CONCLUSIONS M. Katevenis, D. Pnevmatikatos, and X. Yang. FPGA Processing and memory are inseparable aspects of any implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability. In Proceedings of real-world computing problems. A proper memory archi- the 9th international conference on Systems, architectures, tecture is a critical requirement for FPGAs to succeed as modeling and simulation, SAMOS’09, pages 149–156, a computing technology. In this paper, we investigated a Piscataway, NJ, USA, 2009. IEEE Press. new, portable memory architecture called CoRAM to pro- [14] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart Memories: A Modular Reconfigurable vide deliberate support for memory accesses from within Architecture. In ISCA’00: Proceedings of the 27th Annual the fabric of a future FPGA engineered to be a computing International Symposium on Computer Architecture, pages device. CoRAM is designed to match the requirements of 161–171, New York, NY, USA, 2000. ACM. [15] P. Nalabalapu and R. Sass. Bandwidth Management with a highly concurrent, spatially distributed processing kernels Reconfigurable Data Cache. In Proceedings of the 19th IEEE that consume and produce memory data from within the International Parallel and Distributed Processing Symposium fabric. The paper demonstrated the ease-of-use in managing (IPDPS’05) - Workshop 3 - Volume 04, IPDPS ’05, pages the memory access requirements of three non-trivial appli- 159.1–, Washington, DC, USA, 2005. IEEE Computer Society. [16] T. Ngai, J. Rose, and S. Wilton. An SRAM-programmable cations, while allowing the designers to focus exclusively on field-configurable memory. In Custom Integrated Circuits application development without sacrificing portability or Conference, 1995., Proceedings of the IEEE 1995, May 1995. performance. This paper also suggested a possible microar- [17] E. Nurvitadhi, J. C. Hoe, T. Kam, and S.-L. Lu. Automatic chitecture space for supporting the CoRAM architecture in Pipelining from Transactional Datapath Specifications. In Design, Automation, and Test in Europe (DATE), 2010. future reconfigurable fabrics. An investigation of the trade- [18] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. offs between performance, power, and area suggests that Vetterling. Numerical Recipes in C: the Art of Scientific adding support for the CoRAM architecture in future de- Computing. Cambridge University Press, 1988. vices only requires a modest overhead in power and area [19] J. E. Smith. Decoupled access/execute computer architectures. SIGARCH Comput. Archit. News, 10:112–119, April 1982. relative to the reconfigurable fabric. [20] D. T. S. Thoziyoor, D. Tarjan, and S. Thoziyoor. Cacti 4.0. Technical Report HPL-2006-86, HP Labs, 2006. 8. ACKNOWLEDGEMENTS [21] Victor Podlozhnyuk. Black-Scholes Option Pricing, 2007. [22] T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, Funding for this work was provided by NSF CCF- B. Falsafi, and J. C. Hoe. SimFlex: Statistical Sampling of 1012851. We thank the anonymous reviewers and members Computer System Simulation. IEEE Micro, July 2006. [23] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, of CALCM for their comments and feedback. We thank Xil- C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown III, and inx for their FPGA and tool donations. We thank Bluespec A. Agarwal. On-Chip Interconnection Architecture of the Tile for their tool donations and support. Processor. IEEE Micro, 27(5):15–31, 2007. [24] Xilinx, Inc. Virtex-7 Series Overview, 2010. [25] P. Yiannacouras and J. Rose. A parameterized automatic cache 9. REFERENCES generator for FPGAs. In Proc. Field-Programmable [1] Under the Hood: has company at 65 nm. http:// Technology (FPT, pages 324–327, 2003. maltiel-consulting.com/Intel_leads_65-nanometer_technology_ [26] S. P. Young. FPGA architecture having RAM blocks with race_other-Texas_Instruments_Xilinx_AMD_catching_up.htm. programmable word length and width and dedicated address [2] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and and data lines, United States Patent No. 5,933,023. 1996. J. Emer. LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic. In FPGA’11: Proceedings of the 2011 ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, 2011. [3] Bluespec, Inc. http://www.bluespec.com/products/bsc.htm. [4] F. Brewer and J. C. Hoe. MEMOCODE 2007 Co-Design Contest. In Fifth ACM-IEEE International Conference on Formal Methods and Models for Codesign, 2007. [5] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating Compute-Intensive Applications with GPUs and Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

Eric S. Chung, Peter A. Milder, James C. Hoe, and Ken Mai Computer Architecture Laboratory (CALCM) Carnegie Mellon University, Pittsburgh, PA 15213 echung, pam, jhoe, kenmai @ece.cmu.edu { }

Abstract months [5]. In this era of bandwidth- and power-constrained multicores, architects must explore new and unconventional To extend the exponential performance scaling of future ways to improve scalability. chip multiprocessors, improving energy efficiency has become The use of “unconventional” architectures today offers a a first-class priority. Single-chip heterogeneous computing promising path towards improved energy efficiency. One class has the potential to achieve greater energy efficiency by of solutions dedicates highly efficient custom-designed cores combining traditional processors with unconventional cores for important computing tasks (e.g., [6, 7]). Another solution (U-cores) such as custom logic, FPGAs, or GPGPUs. Al- includes general-purpose graphics processing units (GPG- though U-cores are effective at increasing performance, their PUs) where programmable SIMD engines provide accelerated benefits can also diminish given the scarcity of projected computing performance [8, 9]. Lastly, applications on Field bandwidth in the future. To understand the relative merits Programmable Gate Arrays (FPGA) can also achieve high between different approaches in the face of technology con- energy efficiency relative to conventional processors [10]. straints, this work builds on prior modeling of heterogeneous Although energy efficiency plays a crucial role in increas- multicores to support U-cores. Unlike prior models that trade ing peak theoretical performance, the projected scarcity of performance, power, and area using well-known relationships bandwidth can also diminish the impact of simply maxing out between simple and complex processors, our model must the energy efficiency. For designers of future heterogeneous consider the less-obvious relationships between conventional multicores, the selection of which unconventional approaches processors and a diverse set of U-cores. Further, our model to incorporate, if any, is a daunting task. They must address supports speculation of future designs from scaling trends questions such as: Are unconventional approaches worth the predicted by the ITRS road map. The predictive power of gains? Will bandwidth constraints limit these approaches our model depends upon U-core-specific parameters derived regardless? When are (expensive) custom logic solutions by measuring performance and power of tuned applications more favorable than flexible (but less-efficient) solutions? on today’s state-of-the-art multicores, GPUs, FPGAs, and With the wide range of choices and degrees of freedom in ASICs. Our results reinforce some current-day understand- this vast and little-explored design space, it is crucial to ings of the potential and limitations of U-cores and also investigate which approaches merit further study. provides new insights on their relative merits. Building on the work of Hill and Marty [11], this paper extends their model to consider the scalability, power, and 1. Introduction bandwidth implications of unconventional computing cores, Although transistor densities have grown exponentially referred throughout this paper as U-cores. In the applica- from continued innovations in process technology, the power tion of our model, U-cores based on GPUs, FPGAs, and needed to switch each transistor has not decreased as expected custom logic are used to speed up parallelizable sections of due to slowed reductions in the threshold voltage (Vth) [1, an application, while a conventional processor executes the 2, 3]. Therefore, the primary determinant of performance remaining serial sections. In a similar spirit to Hill and Marty, today is often power efficiency, not necessarily insufficient the objectives of our model are not to pinpoint exact design area or frequency. In addition to power concerns, off-chip numbers but to identify important trends worthy of future bandwidth trends are also expected to have a major impact investigation and discussion. on the scalability of future designs [4]. In particular, pin To use our model effectively, various U-core-specific pa- counts have historically grown at about 10% per year— rameters must be calibrated beforehand. Unlike previous in contrast to the doubling of transistor densities every 18 models that relied upon better understood performance and FastCore FastCore BCE BCE BCE BCE microprocessors coupled with private and lower-level shared L1-I L1-D L1-I L1-D caches [14, 15, 16]. L2 L2 BCE FastCore BCE FastCore L1-I L1-D L1-I L1-D In (b), an asymmetric multicore consists of a single fast FastCore FastCore BCE L2 BCE L2

L1-I L1-D L1-I L1-D for sequential execution, while nearby mini- L2 L2 BCE BCE BCE BCE U-cores mally sized baseline cores (we refer to these as Base-Core- Equivalent—BCE as termed in [11]) are used to execute Figure 1: Chip models: parallel sections of code. (a) Symmetric, (b) Asymmetric, (c) Heterogeneous. Finally, Figure 1 (c) illustrates a hypothetical heteroge- power relationships between simple and complex cores (e.g., neous multicore containing a conventional, sequential mi- [11, 12, 13]), the relationships between conventional cores croprocessor coupled with a sea of U-cores. Although the and U-cores are not as obvious. To obtain our U-core illustration shows U-cores as a distinct fabric, they can also be parameters, we measured the performance and power of viewed as being tightly coupled with processors themselves— tuned workloads targeting state-of-the-art CPUs, GPUs, and e.g., in the form of specialized functional units. Since our FPGAs. We also obtained estimates based on synthesized studies focus on measurable technologies available today, we ASIC cores. Using our model, we study the trajectories do not present results for the “dynamic” multicore machine of various hypothetical heterogeneous multicores under the model [11], which can hypothetically utilize all resources for constraints of area, power, and bandwidth as predicted by both serial and parallel sections of an application. This effect the International Technology Roadmap for Semiconductors is captured by our model if the resource in question is power (ITRS) [5]. Our results support several conclusions: or bandwidth. U-cores in general do provide a significant benefit in • 2.1. Review of “Amdahl’s Law for Multicore” performance from improved energy efficiency despite the scarcity of bandwidth. However, in all cases, effectively ex- Before presenting the extensions of our model for single- ploiting the performance gain of U-cores requires sufficient chip heterogeneous computing, we first review Amdahl’s parallelism in excess of 90%. law [17] and the analytical multicore model presented by Hill Off-chip bandwidth trends have a first-order impact on and Marty. Amdahl’s law calculates speedup based on the • the relative merits of U-cores. Custom logic, in particu- original fraction of time f spent in sections of the program that can be sped up by a factor S. Speedup = 1/( f + 1 f ). lar, reaches the bandwidth ceiling very quickly unless an S − application possesses sufficient arithmetic intensity. This The extended forms of Amdahl’s Law for symmetric and allows more flexible U-cores such as FPGAs and GPUs to asymmetric multicore chip models by Hill and Marty are re- keep up in these cases. printed below, which introduce two additional parameters, n Even when bandwidth is not a concern, U-cores such as and r, to represent total resources available and those dedi- • FPGAs and GPGPUs can still be competitive with custom cated towards sequential processing, respectively—in units of logic when parallelism is moderate to high (90% 99%). BCE cores. (Note: all speedups are relative to the performance All U-cores, especially those based on custom logic,− are of a single BCE core). • more broadly useful when power or energy reduction is 1 Speedupsymmetric ( f ,n,r)= 1 f f the goal rather than increased performance. perf− (r) + (n/r) perf (r) seq × seq Outline. Section 2 provides background for the reader. 1 Section 3 presents our modeling of U-cores. Section 4 covers Speedupasymmetric ( f ,n,r)= 1 f f perf− (r) + perf (r)+n r the methods used for collecting performance and power. seq seq − Baseline results are presented in Section 5. Using a calibrated For all chip models, parallel work is assumed to be uniform, model under ITRS scaling trends, the projections and key infinitely divisible, and perfectly scheduled. In their initial results are given in Section 6. We offer conclusions and investigations [11], Hill and Marty use Pollack’s Law [12] as directions for future work in Section 7. input to their model, which observes that sequential process- ing performance from microarchitecture alone grows roughly 2. Background with the square root of transistors used (per f seq (r)= √r). Our study of heterogeneous computing extends the analyti- 2.2. U-cores in Heterogeneous Computing cal modeling for chip multiprocessors by Hill and Marty [11] to include U-cores based on unconventional computing The focus of our study is the modeling of unconventional paradigms such as custom logic, FPGAs, or GPUs. Figure 1 computing cores (U-cores) which were not considered previ- illustrates the chip models used in our study. The symmetric ously. Our study covers three non-von-Neumann computing multicore in (a) resembles the architecture of commercial approaches that have shown significant promise in terms multicores today, consisting of identical high-performance of energy-efficiency and performance: (1) custom logic, (2) GPGPUs, and (3) FPGAs. Custom logic, in particular, Table 1: Bounds on area, power, and bandwidth. can provide the most energy-efficient form of computation Symmetric Asym-offload Heterogeneous (typically 100-1000X improvement in either efficiency or Area constraints n A n A n A performance) through application-specific integrated circuits ≤ ≤ ≤ (ASICs) customized to a specific task [18]. However, custom Parallel power bounds n P n P + r n P + r rα/2 1 ϕ logic is costly to develop and cannot be easily re-purposed ≤ − ≤ ≤ Serial power bounds rα/2 P rα/2 P rα/2 P for new applications. Several proposals have suggested the ≤ ≤ ≤ Parallel bandwidth bounds n B√r n B + r n B + r co-existence of custom logic along with general-purpose ≤ ≤ ≤ µ processors in a single processing die (e.g., [6, 7]). Serial bandwidth bounds r B2 r B2 r B2 Flexible programmable accelerators such as GPGPUs have ≤ ≤ ≤ also been shown to significantly outperform conventional Many extensions to Amdahl’s Law [17] have been pro- microprocessors in target applications [8, 9]. GPGPUs mainly posed. Gustafson’s model [47] revisits the fixed size input derive their capabilities through SIMD vectorization and assumption of Amdahl’s Law. Moncrieff et al. model varying through multithreading to hide long latencies. In the space levels of parallelism and note the advantages of heterogeneous of flexible accelerators, programmable fabrics such as field- systems [48]. Paul and Meyer revisit Amdahl’s assumptions programmable gate arrays (FPGA) represent yet another and develop sophisticted models for single chip systems [49]. potential candidate in single-chip heterogeneous computing. Morad et al. study the power efficiency and scalability of Unlike custom logic, FPGAs enable flexibility through pro- asymmetric chip multiprocessors [13]. Hill and Marty derive grammable lookup tables (LUTs) that can be used to imple- corollaries for single-chip multicores and argue for the im- ment arbitrary logic circuits. In exchange for this flexibility, a portance of sequential processor performance [11]. Eyerman typical 10-100 gap in area and power exists between custom et al.’s multicore model introduced critical sections [50]. logic and FPGAs× [19]. Woo and Lee [51] consider energy, power, and other metrics 2.3. Related Work like performance per watt, while Cho and Melhem focus on energy minimization and power-saving features [52]. A substantial body of work has compared heterogeneous options such as custom logic, GPUs, FPGAs, and multicores 3. Extending Hill and Marty (e.g., [20, 21, 22, 9]). However, most comparisons do not factor out area/technology differences or platform-specific In this section, we present extensions to the model by Hill limitations. Numerous projection studies based on ITRS have and Marty to consider power and performance of single- been previously carried out, focusing mainly on the scalability chip heterogeneous multicore designs based on U-cores. of chip multiprocessors (e.g., [23, 24, 25, 26, 4]). The grow- In similar spirit to their investigations, our model is kept ing body of research in single-chip asymmetric multicores is deliberately simple and omits details such as memory hierar- also highly relevant to this work (e.g., [27, 28, 29]). chy/interconnect and how U-cores communicate. In addition to multicores, many other types of heteroge- 3.1. Extending the Model for Power neous designs exist such as domain-specific processors for digital signal processing [30], network off-loading [31], and To extend our models to include the effects of power limi- physics simulations [32]. Notable products and prototypes in- tations, we require a cost model that includes a power budget clude the IBM Cell processor [33] that integrates a PowerPC that cannot be exceeded in either serial or parallel phases of pipeline with multiple vector coprocessors, domain-specific a workload. We first assume that a BCE core consumes an computing prototypes [34, 35], and emerging integrated CPU- active power of 1 while fully executing, which includes both GPU architectures [36, 37]. leakage and dynamic power. We assume that unused cores can Customization can also be facilitated through application- be powered off entirely without the overhead of static power. specific instruction processors (ASIPs). For example, the To model a power budget, we introduce the parameter P, Xtensa tool from Tensilica [38] can generate custom ISAs, which is defined relative to the active power of a BCE core. register files, functional units, and other datapath components To model power consumption of a sequential microprocessor that best “fit” an application. In the domain of reconfig- relative to a BCE core, a simple power law is used to capture urable computing, PipeRench [39] and RaPiD [40] were the super-linear relationship between performance and power α early projects that developed reconfigurable fabrics for spe- using powerseq (per f )= per f where α was estimated to be cialized streaming pipelined computations, while Garp [41], 1.75 in [53]. Assuming Pollack’s Law (per f = √r) [12], the Chameleon [42], PRISC [43], and Chimaera [44] combined power dissipation of the sequential core is related to the area α α α/2 reconfigurable fabrics with conventional processor cores. Tar- consumed through powerseq = per f =(√r) = r . tan [45], CHIMPS [46], and other high-level synthesis tools Due to the higher power requirements of a sequential can also target high-level code to FPGAs. core, the speedup formula we use in our study later is a modified asymmetric model called asymmetric-offload, which design space for U-cores. For instance, a U-core with µ > 1 considers the case when the power-hungry sequential core is and φ = 1 can be viewed as an accelerator that consumes the powered off during parallel sections. same level of power as before. Similarly, if µ = 1 but φ < 1, 1 the same level of performance is achieved as before but with Speedupasymmetric offload( f ,n,r)= − 1 f f lower power. For heterogeneous multicores based on U-cores, perf− (r) + n r seq − the new speedup formula is shown below: With our new parameters and assumptions in place, the first 1 three rows of Table 1 summarize how the original variables Speedupheterogeneous = 1 f f − + perfseq(r) µ(n r) from Hill and Marty, n and r, are now bounded together by − an area budget A (in units of BCE) and by power budget P. We assume that the conventional microprocessor does not The interpretation of a “bounded” n is simply the maximum contribute to speedup during parallel sections of execution. number of BCE resources that usefully contribute to overall Similar to Hill and Marty [11], we assume that parallel speedup. For example—if the n bounded by power is less than performance scales linearly with the resources used to execute the n bounded by area—power alone limits the number of parallel sections of code. resources that can contribute to overall speedup. Similarly, the Table 1 (last column) shows how the n variable must also serial power bounds shown in Table 1 also limit the amount of be similarly constrained by the power and bandwidth require- resources (r) that can be allocated for sequential performance. ments of U-cores. (Note how lower φ values diminish the 3.2. Model Extension for Bandwidth impact of power budget P, whereas higher µ values increase bandwidth consumption.) With our model in place, the next Apart from power and area, limited off-chip bandwidth section describes the methodology used to obtain accurate is another concern that limits multicore scalability. We as- parameters in order to perform our single-chip heterogeneous sume that a BCE core (at least) consumes the compulsory study using U-cores based on ASICs, FPGAs, and GPUs. bandwidth of an application. We define this to be 1 for a particular workload. This assumption optimistically assumes 4. Methodology that the working set size of an application or kernel can fit within on-chip memories and ignores details such as To obtain reasonable U-core parameters (φ, µ), we mea- memory hierarchy organization. However, as we show later in sure performance and power of real systems. To satisfy the Section 5, tuned algorithms with inputs that can fit within on- assumption in our model that U-core performance scales chip memories will typically achieve compulsory bandwidth linearly with the amount of resources used, we ensured that (as measured in our workloads in Section 5). (The bounds in all measured applications on a given system are compute- Table 1 can be easily modified if the application’s bandwidth bound, such that performance increases would not be possible can be characterized as a function or constant factor of the without more chip area. compulsory bandwidth.) Table 2 summarizes the various devices selected in our For each workload with its own bandwidth requirements, study. The Core i7-960 is a 4-way multicore from Intel [54] a parameter B, in units of BCE compulsory bandwidth, that serves as a baseline for calibrating BCE parameters. defines the maximum off-chip memory bandwidth available. The GTX285 and GTX480 are two successive generations For bandwidth estimation of non-BCE processing cores, we of high-performance, programmable GPUs from Nvidia [55], assume that bandwidth scales linearly with respect to BCE while the R5870 is a similarly-capable GPU from AMD [56]. performance—for example, a core twice as fast as a BCE core The Virtex-6 LX760 is the largest FPGA available from consumes a bandwidth of 2. Table 1 (bottom) summarizes Xilinx [57], and 65nm commercial synthesis flows are used how n and r are bounded to the maximum compulsory to synthesize custom logic for each application. bandwidth supported by the machine. In Table 2 we show the estimated area contribution from core- and cache-only components, which will be used later 3.3. Modeling U-cores to estimate U-core parameters. For devices with available With power and bandwidth in place, we now introduce die photos, we subtracted non-compute areas such as on-die U-cores into our model. Recall from earlier that a U-core memory controllers and I/O. For the R5870 without a die represents an unconventional form of computing such as photo, we based our estimates assuming a non-compute over- custom logic, GPUs, or FPGAs. To model this, we begin head of 25%. For the FPGA, we estimated the area per LUT with the assumption that a BCE-sized U-core can be used and associated overhead to be roughly .00191mm2, including to execute exploitable parallel sections of code at a relative the amortized overhead of flip-flops, RAMs, multipliers, and performance of µ compared to a BCE core. In addition, an interconnect. To estimate areas for the ASIC cores, we used actively executing U-core consumes a power of φ, which is Synopsys Design Compiler [58] targeting a commercial 65nm in units of BCE power. Together, µ and φ characterize a standard cell library. Table 2: Summary of devices.

Corei7-960 GTX285 GTX480 R5870 V6-LX760 ASIC Year 2009 2008 2010 2009 2009 2007 Node Intel/45nm TSMC/55nm TSMC/40nm TSMC/40nm UMC/Samsung/40nm 65nm 2 2 2 2 Technology Die area 263mm 470mm 529mm 334mm - - Core area 193mm2 338mm2 422mm2 - - - Clock rate 3.2GHz 1.476GHz 1.4GHZ 1.476GHz - - Voltage 0.8-1.375V 1.05-1.18V 0.96-1.025V 0.95-1.174V 0.9-1.0V 1.1V System ASUS P6T XFX 285 eVGA 480 XFX 5870 - - Platform Memory 3GB DDR3 1GB GDDR3 1.5GB GDDR5 1GB GDDR5 - Bandwidth 32GB/sec 159GB/sec 177.4GB/sec 153.6GB/sec - - Table 3: Summary of workloads.

Corei7-960 GTX285 GTX480 R5870 LX760 ASIC Dense Matrix Multiplication (MMM) MKL 10.2.3 CUBLAS 2.3 CUBLAS 3.0/3.1beta CAL++ Bluespec (by hand) Fast Fourier Transform (FFT) Spiral CUFFT 2.3/3.0/3.1beta CUFFT 3.0/3.1beta – Verilog (Spiral-generated) Black-Scholes (BS) PARSEC (modified) CUDA 2.3 CUDA 3.1 ref. – Verilog (generated)

4.1. Overview of Workloads For the Core i7, we measured the EATX12V power rail, The algorithms and workloads we study are shown in which only supplies power to the cores and private L1/L2 Table 3. Matrix-Matrix Multiplication (MMM), in particular, caches [64]. At time of publication, we did not possess is a kernel with high arithmetic intensity and simple memory a platform with the LX760 FPGA, and instead, relied on requirements. Fast Fourier Transform (FFT), on the other a smaller FPGA (LX240T/ML605) to perform our power hand, possesses complex dataflow and memory requirements, measurements. while Black-Scholes (BS) was selected due its rich mixture of Although beyond the scope of this work, a significant arithmetic operators. To satisfy compute-bound requirements, amount of effort was placed into measuring GPU power all kernels are assumed to be throughput-driven, i.e., many consumption, due to the numerous non-computing related independent inputs are being computed. In all cases, we components (e.g., RAM). To achieve this, a set of mi- employed single-precision, IEEE-compliant floating point. crobenchmarks were designed to measure and subtract out For the CPUs and GPUs, we used various tuned math non-compute power dissipation from on-die memory con- libraries for Matrix-Matrix Multiplication (MMM). To obtain trollers and off-chip GDDR memory. Power results for the optimized FFT implementations for the Core i7, we utilized ASIC were estimated from synthesis tools. the Spiral framework [59]. For the Nvidia-based GPUs, we 5. Baseline Performance and Power used the best results across three of the latest versions of CUFFT [60]. For Black-Scholes on the GPU, we used In this section we present baseline performance and power reference code from Nvidia. For the CPU, we used the results. To explain how U-core parameters are derived, we multithreaded version from PARSEC [61] and added addi- use FFT as a case study, and present the remaining results tional SSE optimizations. We were unable to obtain optimized for other workloads at the end. Starting with Figure 2 (top), all 4 20 FFT/BS for the R5870 and BS for the GTX480. device performances are plotted from input sizes 2 to 2 . For the FPGA and ASIC, we developed an optimized im- However, without normalizing for die area, comparing any plementation of MMM by hand. For FFTs, we used the Spiral two devices at face value would not be fair in the context of framework to generate optimized synthesizable RTL [62]. designing a single chip multicore. Instead, Figure 2 (bottom) Finally, for Black-Scholes, we developed a software tool to normalizes all performances to die area in 40nm/45nm. As automatically create hardware pipelines from a high-level expected, the ASIC FFT cores achieve nearly 100X im- description of math operators. To ensure a fully-utilized provement over the flexible cores (FPGA, GPU), and nearly FPGA, designs were scaled until timing could no longer be 1000X improvement over the Core i7. Figure 3 shows the met. For the ASIC, we synthesized the same designs and total power consumption of devices. A similar normalization estimated results using Synopsys Design Compiler 2009-06- step for comparing power is illustrated in Figure 4 (top). As SP5 configured with commercial 65nm standard cells. We expected, the ASIC achieves nearly two orders of magnitude used Cacti [63] to model SRAM area and power for the ASIC. improvement in power and 10X over the GPUs/FPGA. A final step we took was to verify that the FFTs were 4.2. Power Methodology compute-bound. To achieve this, we used CPU and GPU To collect power data, a current probe was used to measure performance counters to measure off-chip bandwidth. A no- various devices while running applications in steady state. table result for the GPU is shown in Figure 4 (bottom), FFT Power Consumption Breakdown (non-normalized) 250 Unknown Core Leakage Uncore Static Uncore Dynamic Core Dynamic 200 150 100

Power(w) 50 0 5 7 9 1113151719 4 6 8 101214 5 7 9 1113151719 4 6 8 101214161820 5 7 9 11 13 Core i7 LX760 GTX285 GTX480 ASIC

Figure 3: FFT Power Consumption (x-axis is FFT input size in log2 N).

FFT Performance (non-normalized) FFT Energy Efficiency (40nm) 1000 100 Core i7 LX760 GTX285

per J per GTX480 s 10 ASIC 100 1 Core i7 LX760

Pseudo-GFLOP/s GTX285 GTX480 ASIC 10 0 Pseudo-GFLOP 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Log 2(N) Log 2(N)

Area-normalized FFT Performance (40nm) FFT Bandwidth 160 Core i7 140 100 LX760 GTX285 120 /s per /s

P GTX480 100

2 10 ASIC 80 mm 1 60 Bandwidth (GB/sec) Bandwidth

y 40 FFT compulsory bandwidth (GTX285) 0.1 20 FFT measured bandwidth (GTX285) Pseudo-GFLO FFT compulsory bandwidth (GTX480)

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (GB/sec) Bandwidth Memory 0

Log 2(N) 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Log (N) Figure 2: FFT performance in pseudo-GFLOPS/s 2 (# FLOPS = 5N log2 N). Figure 4: FFT energy efficiency and bandwidth. where the GTX285 consumes compulsory bandwidth for library (we expected a doubling of throughput); it possible input sizes until 212 when the data no longer fits in on- that further software tuning is needed. chip memory. When data fits on-chip, the GTX285 does 5.1. Deriving U-core Parameters not reach the peak bandwidth of 159GB/sec, which suggests compute-bound performance. Interestingly, the GTX285 is Recalling from Section 3.3, φ and µ are parameters that still compute-bound for input sizes 212 due to increasing characterize a BCE-sized U-core, where φ is the relative arithmetic intensity and efficient out-of-core≥ algorithms being power efficiency to a BCE core, and µ is the relative perfor- employed. For the GTX480, we were unable to measure the mance. To determine parameters of the BCE core, we treat bandwidth counters—thus, our results may not necessarily the Core i7 as a fast core and derive BCE core parameters. To reflect compute-bound performance. determine “r” for the Core i7, we base our estimates on an Intel , which is a 26mm2 in-order processor in the 45nm Summary of Results. The absolute performance and nor- process [65] (we subtract 10% for non-compute area). An r malized results for remaining workloads (MMM and BS) value of 2 roughly gives the equivalent size of a single Core are summarized in Table 4. For MMM, the R5870 GPU i7 (about 193mm2/(4 cores)). Based on formulas that relate performed the best, achieving nearly 1.5 TeraFLOPs of performance and power to area (per f = √r, power = rα/2), performance. With a small die size, the R5870 achieves a we derive per f /mm2 and per f /W for the BCE core (we high area-normalized performance almost comparable to the assume α = 1.75 [53]). These derived metrics are then used dedicated ASIC core (although still less energy-efficient). to compute φ and µ1 for various U-cores in Table 5. We were surprised that the GTX480 only achieved a 27% xucore perf µ ecorei7 perf improvement in MMM over the GTX285 using the CUBLAS 1. µ = where xi= 2 ; φ = (1 α×)/2 where ei= . xcorei7√r mm r eucore watt − × Table 4: Summary of results for MMM and BS. Table 6: Parameters assumed in technology scaling.

GFLOP/s (GFLOP/s)/mm2 GFLOP/J Year 2011 2013 2016 2019 2022 Corei7 96 0.50 1.14 Technologynode 40nm 32nm 22nm 16nm 11nm GTX285 425 2.40 6.78 Core die budget (mm2) 432 432 432 432 432 GTX480 541 1.28 3.52 Corepowerbudget(W) 100 100 100 100 100 MMM R5870 1491 5.95 9.87 Bandwidth(GB/s) 180 198 234 234 252 LX760 204 0.53 3.62 Maxarea(BCEunits) 19 37 75 149 298 ASIC 694 19.28 50.73 Rel. pwr per transistor 1X 0.75X 0.5X 0.36X 0.25X Rel. bandwidth 1X 1.1X 1.3X 1.3X 1.4X Mopts/s (Mopts/s)/mm2 Mopts/J Corei7 487 2.52 4.88 serious challenge is the slowing of bandwidth, as evidenced GTX285 10756 60.72 189 by the gradual increases in projected pin counts (< 1.5X over GTX480 - - - Black-Scholes R5870 - - - fifteen years). LX760 7800 20.26 138 With these challenges in mind, we pose several critical ASIC 25532 1719 642.5 questions: (1) under a bandwidth- and power-limited exis- tence, would future single-chip multicores benefit signifi- Table 5: U-core parameters (φ = relative BCE power cantly from having custom logic, FPGAs, and GPUs? (2) for efficiency, µ = relative BCE performance). the best possible performance, will custom logic always be the right answer under a power- and bandwidth-dominated MMM BS FFT-64 FFT-1024 FFT-16384 regime? and (3) would our conclusions change if lowering φ 0.74 0.57 0.59 0.63 0.89 GTX285 µ 3.41 17.0 2.42 2.88 3.75 total energy is the primary objective instead of maximizing performance? φ 0.77 - 0.39 0.47 0.66 GTX480 µ 1.83 - 1.56 2.20 2.83 To investigate these questions, we scale the performance of heterogeneous and non-heterogeneous multicores based on φ 1.27 - - - - R5870 µ 8.47 - - - - ITRS predictions. Table 6 summarizes various parameters and assumptions. We allow a maximum die budget of 576mm2 φ 0.31 0.26 0.29 0.29 0.37 LX760 µ 0.75 5.68 2.81 2.02 3.02 (e.g., Power7 [66]) and reserve 25% area for non-compute φ 0.79 4.75 5.34 4.96 6.38 components (e.g., on-die memory controllers) [26]. A 100W ASIC µ 27.4 482 733 489 689 power budget was selected for core- and cache-only com- ponents (not including power reserved for the non-compute 6. Scaling Projections components). We assume that clock frequencies do not scale further after 40nm. In this section, we apply our model using U-core param- For bandwidth, we assume that 180 GB/s can be opti- eters from Table 5 to project the performance of single-chip mistically reached in 2011 based on state-of-the-art high- heterogeneous computing devices based on the ITRS 2009 end devices today (rounding up GTX480’s 177GB/s). To road map [5]. Figure 5 shows the long-term expected trends estimate the compulsory bandwidth for FFT2, we assume for pin count, Vdd, and gate capacitance. Assuming that bytes an input size of 1024, which gives 0.32 flop . For Black- clock frequencies do not increase, the reduction in power per Scholes, the compulsory bandwidth is 10bytes . To compute transistor is expected to drop only by a factor of 5X over the option compulsory bandwidth for MMM3, we assume square matrix next fifteen years (in contrast to the doubling of transistor inputs blocked at N = 128, which gives 0.0313 bytes . We density with each additional technology node). An even more flop exclude the ASIC-based MMM core from this constraint, since our design at 40nm can support N 2048. ITRS 2009 Projections ≥ 6.1. Overview of Results 1.4 1.2 The scaling projections for all U-core-based heterogeneous 1.0 multicores (referred to as HET) and non-heterogeneous mul- 0.8 ticores (referred to as CMP) are shown starting with Figure 6. 0.6 Package pins In each figure, we vary the f parameter, which indicates the 0.4 Vdd Gate capacitance fraction of time a particular kernel (e.g., FFT, MMM, BS)

Normalized to to 2011 Normalized 0.2 Combined technology power reduction is being sped up by U-cores. To determine the optimal size 0.0 of the sequential core, we sweep all values of r (sequential

5N log2 N f lops 2. FFT arithmetic intensity (32-bit) = 16 Nbytes = 0.3125log2 N. Figure 5: ITRS 2009 Scaling Projections × 3 3. MMM arithmetic intensity (32-bit) = 2 N f lops N 4. 2×4N2bytes = / (High-performance MPUs and ASICs [5]). × 8 8 7 7 6 6 5 5 4 4 3 Speedup Speedup 3 2 2 1 f=0.500 1 f=0.500 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm 40 25 35 20 30 25 15 20

Speedup 10 Speedup 15 10 5 f=0.900 5 f=0.900 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm 60 300 50 250 40 200 30 150

Speedup 20 Speedup 100 10 f=0.990 50 f=0.990 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm 70 1000 60 800 50 600 40 30 400 Speedup Speedup 20 200 10 f=0.999 f=0.999 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm (0) SymCMP (3) GTX285 (0) SymCMP (4) GTX480 (1) AsymCMP (4) GTX480 (1) AsymCMP (5) R5870 (2) LX760 (6) ASIC (2) LX760 (6) ASIC (3) GTX285 Figure 6: FFT-1024 projection. (Dashed lines = power-limited, solid = bandwidth-limited) Figure 7: MMM projection. (Dashed lines = power-limited, solid = bandwidth-limited) core size) up to 16 for each particular design point and report the maximum speedup. All speedups are reported relative to is interesting to note that the FPGA design reaches ASIC- 1 BCE core. To interpret the graphs, data points that are like bandwidth-limited performance as early as 32nm—and connected by dashed lines indicate designs that cannot be similarly for the GPU designs, around 22nm and 16nm. scaled any further due to power (i.e., the full area budget MMM. Results for MMM are shown in Figure 7. Unlike cannot be utilized). Data points that are connected by solid FFT, MMM possesses high arithmetic intensity and consumes lines indicate bandwidth-limited designs. Finally, data points relatively low memory bandwidth. For all values of f , the that not connected by lines are limited only by area. ASIC achieves the highest performance without becoming bandwidth-limited. However, unless f 0 999, less-efficient FFT. Beginning with FFT-1024 in Figure 6, four different . approaches based on GPUs or FPGAs≥ can still achieve graphs are plotted with varying amounts of parallelism ( f ). speedups within a factor of two to five of the ASIC. It is also At all values of f , the ASIC achieves the highest level interesting to see how most designs are initially area-limited of performance but cannot scale further due to bandwidth in 40nm and 32nm, but transition to becoming power-limited limitations. At f = 0.5, the lack of sufficient parallelism 22nm and after. results in none of the HETs providing a significant perfor- mance gain over the CMPs—despite being more energy- Black-Scholes. The results for Black-Scholes in Figure 8 efficient. The pronounced differences emerge when f 0.90. are similar to FFT, where most HET designs quickly converge Here, the HETs achieve more performance per unit≥ power, towards becoming bandwidth-limited. However, without suf- which only becomes beneficial with sufficient parallelism. ficient parallelism ( f 0.5), even the conventional CMPs However, even with sufficient parallelism beyond f 0.99, achieve speedups within≤ a factor of two of the ASIC. Just the lack of memory bandwidth limits any further speedup.≥ It as we observed in FFT, sufficient parallelism must exist 8 8 7 7 6 6 5 5 4 4

Speedup 3 Speedup 3 2 2 1 f=0.500 1 f=0.500 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm 40 35 35 30 30 25 25 20 20 15 15 Speedup Speedup 10 10 5 f=0.900 5 f=0.900 0 0 40nm 32nm 22nm 16nm 11nm 40nm 32nm 22nm 16nm 11nm (0) SymCMP (3) GTX285 200 (1) AsymCMP (6) ASIC (2) LX760 150

Figure 8: Black-Scholes projection. 100

(Dashed lines = power-limited, solid = bandwidth-limited) Speedup 50 ( f 0.9) before the HETs can offer significant performance f=0.990 0 benefits≥ over the CMPs. It is interesting to note again that 40nm 32nm 22nm 16nm 11nm the FPGAs and GPUs are also able to attain ASIC-like 350 300 bandwidth-limited performance. 250 200 6.2. Projections under Alternative Scenarios 150 Speedup 100 To investigate other scaling scenarios, we vary the fol- 50 f=0.999 0 lowing inputs to our model: (1) reduce initial bandwidth (in 40nm 32nm 22nm 16nm 11nm 40nm) to 90 GB/s, (2) increase bandwidth to 1 TB/s, (3) halve (0) SymCMP (3) GTX285 (1) AsymCMP (4) GTX480 core-only area budget to 216mm2, (4) double power budget (2) LX760 (6) ASIC to 200W, (5) decrease power to 10W, and (6) increase core Figure 9: FFT-1024 projection given 1 TB/sec bandwidth. sequential power by setting α to 2.25. Below, we explain the (Dashed lines = power-limited, solid = bandwidth-limited) rationale for each scenario and discuss results qualitatively. Scenario 3 (area). Reducing the area budget approxi- Scenarios 1 and 2 (bandwidth). A starting bandwidth mates lower-cost manufacturing constraints (e.g., increasing of 90 GB/s approximates a reduction in off-chip bandwidth yield). In this scenario, HETs in the earlier nodes (40nm, costs (i.e., about half of what can be built into the highest-end 32nm) become area-limited and provide less speedup over the designs in 40nm). We observe that FFT and BS on the FPGAs CMPs. Surprisingly, in the later nodes ( 22nm), most designs and GPUs converge even faster to the ASIC’s performance achieve similar performance to what was≤ attained under the in earlier technology nodes (i.e., becoming bandwidth-limited original area budget. The explanation is that most designs at 32nm). However, because of the low bandwidth ceiling in the later nodes are limited by power to begin with, not in FFT, the CMPs also achieve within a factor of two of because of insufficient area. the ASICs at around 22nm (at any f ). In BS, the large gap between HETs and CMPs still exists because the CMPs are Scenarios 4 and 5 (power budget). Doubling the power unable to achieve close to bandwidth-limited performance. budget to 200W approximates high-end systems with more To approximate disruptive technologies such as embedded capable cooling and power solutions. Under a higher power DRAMs or 3D-stacked memories [67], our next scenario budget, the relative benefit of having energy-efficient HETs assumes a starting bandwidth of 1 TB/sec. For FFT, Figure 9 diminishes since the less efficient CMPs are able to close shows how most designs transition to becoming power- the gap in performance; this is especially noticeable in FFT limited, with the ASIC still being bandwidth-limited from when most HETs become bandwidth-limited. On the other the start. At f = 0.9, all of the HETs provide further gains hand, a 10W power budget approximates power-constrained over the CMPs (around 2-3X speedup). However, the ASIC devices such as laptops and mobiles. Here, we observe that can only provide a significant speedup (about 2X) over the only the ASIC-based HETs can ever approach bandwidth- other HET approaches when f 0.999. limited performance, giving them an significant advantage ≥ over the power-constrained HETs and CMPs. 2.5 2 Scenario 6 (serial power). Our last scenario approximates 1.5 a sequential processor that consumes high power when scaled in performance (setting α = 2.25). The trends are similar 1 0.5 to the original projections with one notable exception. At f=0.500 Energy (normalized) 0 low to moderate parallelism ( f 0.9), the speedups decrease 40nm 32nm 22nm 16nm 11nm significantly. Because the sequential≤ processor’s size is con- 1.2 strained by the serial power budget (rα/2 P), the optimal 1 size of the core cannot be attained. Our results≤ agree with Hill 0.8 and Marty [11]—that is, improving sequential performance 0.6 remains important. However, such performance cannot be 0.4 0.2 f=0.900 achieved without accompanying improvements in sequential Energy (normalized) 0 power efficiency. 40nm 32nm 22nm 16nm 11nm 1 6.3. Discussion 0.8 Our results show that the complex interplay between 0.6 bandwidth, power, and parallelism has tremendous impli- 0.4 0.2 cations for various heterogeneous and non-heterogeneous f=0.990 Energy (normalized) computing approaches. To answer the first question—would 0 future multicores gain from having custom logic, FPGAs, and 40nm 32nm 22nm 16nm 11nm (0) SymCMP (4) GTX480 GPGPUs?—our projections reveal that heterogeneous multi- (1) AsymCMP (5) R5870 cores based on U-cores do achieve significant performance (2) LX760 (6) ASIC (3) GTX285 gain over asymmetric and symmetric multicores—despite the scarcity of bandwidth. This performance is achieved with the Figure 10: MMM Energy Projections (normalized to BCE). increased power efficiency of U-cores and when sufficient parallelism exists ( f 0.90). system with processors, a U-core can be used to speed ≥ To answer the second question—if custom logic is always up parallel sections of an application while allowing the the right answer—a recurring theme we observed throughout sequential processor to slow down with a significant reduction was the impact of scarce bandwidth. Although custom logic in power [1, 2]. Another previously proposed method [6] provided the best performance under all circumstances, less allows a power-hungry sequential processor to offload sec- efficient solutions such as GPUs and FPGAs were also able to tions of serial code to custom logic. While custom logic can reach bandwidth-limited performance for FFT and BS. Even satisfy this role easily, reconfigurable fabrics such as FPGAs in the case of MMM, which possessed the highest arithmetic could also be used to fulfill this objective (e.g., [45, 46]). intensity, the ASIC did not show significant benefits over the Finally, given the ability of custom logic and FPGA-like less efficient solutions unless f > 0.99. fabrics to extract multiple degrees of parallelism, U-cores Recalling the last question—if energy should be prioritized based on these techniques may also be suitable for increasing over performance—we briefly show an example that justifies sequential processor performance at a lower energy cost. why custom logic may be desirable even if parallelism is Relative U-core merits. Although our studies compared low or if memory bandwidth is a limitation. In Figure 10, designs based on different U-cores, more compelling ap- the total energy consumed (normalized to BCE energy at proaches would allow for mixing and matching. In our 40nm) for various devices is plotted for MMM. Note that the study, a simplifying assumption we made is that all U-cores energy decreases across generations are partially attributed to and processors can expose identical levels of parallelism. circuit improvements. At low levels of parallelism ( f = 0.5), As seen in practice (e.g., [22, 20, 21, 20]), GPGPUs have the opportunity to reduce the energy consumed is limited typically shown their strength for homogeneous data paral- by the sequential core. However, at even moderate levels lel tasks, while custom logic and FPGAs can be used to of parallelism ( f = 0.9 0.99), the ASIC still achieves a pipeline irregular data flows as well as data parallel tasks. − significant reduction in energy relative to the other U-cores. Furthermore, the FPGA results we present should be treated Although we do not quantify this in our study, custom conservatively given our selection of single-precision floating logic and other low-power U-cores could potentially be used point applications (FPGAs lack dedicated FPUs.) With the to reduce sequential power consumption or to efficiently abundance of area (but shortage of power) in the future, a improve sequential processing performance. First, if the goal compelling prospect is to fabricate different U-cores that are is to achieve the same level of performance as a baseline powered on-demand for suitable tasks. In similar spirit to suggestions made by others such as Hardavellas [24] and work has been done on integrating FPGAs with conventional Venkatesh et al. [6], specialized cores could be mixed and multicores. Many issues require investigation such as how matched towards specific applications. For example, a high FPGAs should be integrated into existing software stacks and arithmetic intensity kernel such as MMM could be fabricated memory hierarchies. Finally, the most immediate challenge on as custom logic alongside GPU- or FPGA-based U-cores used the horizon is how to attack memory bandwidth limitations. to accelerate bandwidth-limited kernels such as FFTs. While U-cores are effective at scaling the power wall, their long-term impact will increase even further if the bandwidth Model validity and concerns. By no means is our model ceiling can be lifted through future innovations. infallible, nor our workloads universally representative. We remind that the objectives of this work are not to pinpoint Acknowledgments exact design results, but to identify trends worthy of further investigations. Our selection of workloads, although limited, We thank the anonymous reviewers, members of CALCM, capture a spread of different characteristics. Our model, and John Davis for their feedback and suggestions. We thank although simple, captures the salient constraints that any Nikos Hardavellas for his influencing work on CMP scaling designer must be conscious of. The quality of our predictions projections. Eric Chung is supported by a Microsoft Research will depend on two assumptions: (1) microarchitectures based Fellowship. We thank Spiral for providing FFT libraries and on conventional CPUs, FPGAs, and GPGPUs do not change support. We thank Xilinx for their FPGA and tool donations. substantially in the future, and (2) ITRS predictions are correct. Obviously, the further we predict, the higher chance References that some predictions will go askew. To check the quality of [1] T. Mudge, “Power: A First-Class Architectural Design Constraint,” Computer, vol. 34, no. 4, pp. 52–58, 2001. our predictions, we are pursuing further studies using older [2] S. Borkar, “Getting Gigascale Chips: Challenges and Opportunities in devices; data already collected from 55nm/65nm devices Continuing Moore’s Law,” Queue, vol. 1, no. 7, pp. 26–33, 2003. support the same conclusions from Section 6. Lastly, although [3] M. Horowitz, “Scaling, Power and the Future of CMOS,” in Proceed- ings of the 20th International Conference on VLSI Design held jointly programmability is an often raised concern, we deliberately with 6th International Conference: Embedded Systems. Washington, excluded it from consideration. From our experience in de- DC, USA: IEEE Computer Society, 2007, p. 23. veloping high performance FFT libraries [59], the tuning of [4] B. M. Rogers et al., “Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling,” in ISCA’09: Proceedings of the 36th Annual high-valued kernels on GPGPUs, FPGAs, or even CPUs will International Symposium on Computer Architecture. New York, NY, always require substantial investment. USA: ACM, 2009, pp. 371–382. [5] International Technology Roadmap for Semiconductors. http://www. 7. Conclusions and Future Directions itrs.net. [6] G. Venkatesh et al., “Conservation Cores: Reducing the Energy of Mature Computations,” in ASPLOS’10: Proceedings of the 15th In- To answer the question posed in the title of this paper— ternational Conference on Architectural Support for Programming whether the future should include custom logic, FPGAs, and Languages and Operating Systems. New York, NY, USA: ACM, GPGPUs—the evidence from our measurements and from 2010, pp. 205–218. [7] S. Yehia et al., “Reconciling Specialization and Flexibility Through the projections made in our model show a strong case that Compound Circuits,” in HPCA’09: Proceedings of the 15th Interna- the answer is yes. Our findings showed that: (1) sufficient tional Symposium on High Performance Computer Architecture, 14-18 parallelism must exist before U-cores offer significant perfor- 2009, pp. 277 –288. [8] D. Luebke et al., “GPGPU: General Purpose Computation on Graphics mance gains, (2) off-chip bandwidth trends have a first-order Hardware,” in SIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes. effect on the relative merits of various U-cores (i.e., flexible New York, NY, USA: ACM, 2004, p. 33. U-cores can keep up with U-cores based on custom logic [9] V. W. Lee et al., “Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU,” in ISCA’10: if bandwidth is a limitation), (3) even when bandwidth is Proceedings of the 37th Annual International Symposium on Computer not a concern, less efficient but more flexible U-cores based Architecture. New York, NY, USA: ACM, 2010, pp. 451–460. on GPUs and FPGAs are competitive with custom logic if [10] S. Hauck et al., Reconfigurable Computing: The Theory and Practice of FPGA-Based Computing parallelism is moderate to high (90% 99%), and (4) U-cores, . Morgan Kaufmann, 2008. − [11] M. D. Hill et al., “Amdahl’s Law in the Multicore Era,” Computer, especially those based on custom logic, are more broadly vol. 41, pp. 33–38, 2008. useful if reducing energy or power is the primary goal. [12] F. J. Pollack, “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies (keynote address),” in There are numerous future directions worth pursuing. First, MICRO’99: Proceedings of the 32nd Annual ACM/IEEE International improved models and further case studies will be essential to Symposium on Microarchitecture. Washington, DC, USA: IEEE sorting out the complex maze of choices a heterogeneous Computer Society, 1999, p. 2. [13] T. Y. Morad et al., “Performance, Power Efficiency and Scalability multicore designer must face. Models in the future should of Asymmetric Cluster Chip Multiprocessors,” IEEE Comput. Archit. attempt to incorporate varying degrees of parallelism in an Lett., vol. 5, no. 1, p. 4, 2006. application, in order to capture how “suitable” certain types [14] Intel Inc. (2009) Intel Microarchitecture, Codenamed Nehalem. http://www.intel.com/technology/architecture-silicon/nextgen. of U-cores might be under a given parallelism profile. Second, [15] C. N. Keltcher et al., “The AMD Opteron Processor for Multiprocessor although significant research is underway for GPUs, little Servers,” IEEE Micro, vol. 23, no. 2, pp. 66–76, 2003. [16] D. Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Programmable Logic, Smart Applications, New Paradigms and Com- Processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, 2007. pilers. London, UK: Springer-Verlag, 1996, pp. 126–135. [17] G. M. Amdahl, “Validity of the single processor approach to achieving [41] J. R. Hauser et al., “Garp: a MIPS processor with a reconfigurable large scale computing capabilities,” in AFIPS’67 (Spring): Proceedings coprocessor,” in FCCM’97: Proceedings of the 5th IEEE Symposium on of the April 18-20, 1967, spring joint computer conference. New York, FPGA-Based Custom Computing Machines. Washington, DC, USA: NY, USA: ACM, 1967, pp. 483–485. IEEE Computer Society, 1997, p. 12. [18] W. J. Dally et al., “Efficient Embedded Computing,” Computer, vol. 41, [42] Drew Wilson. (2000, Oct) Chameleon takes on FPGAs, ASICs. no. 7, pp. 27–32, 2008. http://www.edn.com/article/CA50551.html?partner=enews. [19] I. Kuon et al., “Measuring the Gap Between FPGAs and ASICs,” in [43] R. Razdan et al., “A High-Performance Microarchitecture with FPGA’06: Proceedings of the 2006 ACM/SIGDA 14th International Hardware-Programmable Functional Units,” in MICRO’94: Proceed- Symposium on Field Programmable Gate Arrays. New York, NY, ings of the 27th Annual International Symposium on Microarchitecture. USA: ACM, 2006, pp. 21–30. New York, NY, USA: ACM, 1994, pp. 172–180. [20] S. Che et al., “Accelerating Compute-Intensive Applications with [44] Z. A. Ye et al., “CHIMAERA: A High-Performance Architecture GPUs and FPGAs,” in SASP’08: Proceedings of the 2008 Symposium with a Tightly-Coupled Reconfigurable Functional Unit,” in ISCA’00: on Application Specific Processors. Washington, DC, USA: IEEE Proceedings of the 27th Annual International Symposium on Computer Computer Society, 2008, pp. 101–107. Architecture. New York, NY, USA: ACM, 2000, pp. 225–235. [21] D. B. Thomas et al., “A Comparison of CPUs, GPUs, FPGAs, and [45] M. Mishra et al., “Tartan: Evaluating Spatial Computation for Whole Massively Parallel Processor Arrays for Random Rumber Generation,” Program Execution,” in ASPLOS’06: Proceedings of the 12th In- in FPGA’09: Proceeding of the ACM/SIGDA International Symposium ternational Conference on Architectural Support for Programming on Field Programmable Gate Arrays. New York, NY, USA: ACM, Languages and Operating Systems. New York, NY, USA: ACM, 2009, pp. 63–72. 2006, pp. 163–174. [22] N. Kapre et al., “Accelerating SPICE Model-Evaluation using FPGAs,” [46] A. Putnam et al., “Performance and Power of Cache-Based Recon- Field-Programmable Custom Computing Machines, Annual IEEE Sym- figurable Computing,” in ISCA’09: Proceedings of the 36th Annual posium on, vol. 0, pp. 37–44, 2009. International Symposium on Computer Architecture. New York, NY, [23] N. Hardavellas, “Chip-Multiprocessors for Server Workloads,” Ph.D. USA: ACM, 2009, pp. 395–405. dissertation, 2009, supervisors-Babak Falsafi and Anastasia Ailamaki. [47] J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, [24] N. Hardavellas, “When Core Multiplicity Doesn’t Add Up (Keynote),” vol. 31, no. 5, pp. 532–533, 1988. in International Symposium on Parallel and Distributed Computing, [48] D. Moncrieff et al., “Heterogeneous computing machines and Istanbul, Turkey, July 2010. Amdahl’s law,” Parallel Computing, vol. 22, no. 3, pp. [25] S. Li et al., “McPAT: An Integrated Power, Area, and Timing Modeling 407 – 413, 1996. http://www.sciencedirect.com/science/article/ Framework for Multicore and Manycore Architectures,” in MICRO’09: B6V12-3VS5J8D-5/2/2363475aa2ac8e332e5fa124c9a9c3a1. Proceedings of the 42nd Annual IEEE/ACM International Symposium [49] J. M. Paul et al., “Amdahl’s Law Revisited for Single Chip Systems,” on Microarchitecture. New York, NY, USA: ACM, 2009, pp. 469–480. Int. J. Parallel Program., vol. 35, no. 2, pp. 101–123, 2007. [26] J. D. Davis et al., “Maximizing CMP Throughput with Mediocre [50] S. Eyerman et al., “Modeling Critical Cections in Amdahl’s Law and Cores,” in PACT’05: Proceedings of the 14th International Conference its Implications for Multicore Design,” in ISCA’10: Proceedings of the on Parallel Architectures and Compilation Techniques. Washington, 37th Annual International Symposium on Computer Architecture. New DC, USA: IEEE Computer Society, 2005, pp. 51–62. York, NY, USA: ACM, 2010, pp. 362–370. [27] R. Kumar et al., “Heterogeneous Chip Multiprocessors,” Computer, [51] D. H. Woo et al., “Extending Amdahl’s Law for Energy-Efficient vol. 38, no. 11, pp. 32–38, 2005. Computing in the Many-Core Era,” Computer, vol. 41, no. 12, pp. [28] E. Ipek et al., “Core Fusion: Accommodating Software Diversity in 24–31, 2008. Chip Multiprocessors,” in ISCA’07: Proceedings of the 34th Annual [52] S. Cho et al., “On the Interplay of Parallelization, Program Perfor- International Symposium on Computer Architecture. New York, NY, mance, and Energy Consumption,” IEEE Transactions on Parallel and USA: ACM, 2007, pp. 186–197. Distributed Systems, vol. 21, pp. 342–353, 2010. [29] M. A. Suleman et al., “Accelerating Critical Section Execution with [53] E. Grochowski et al., “Energy per Instruction Trends in Intel Micro- Asymmetric Multicore Architectures,” IEEE Micro, vol. 30, no. 1, pp. processors,” in Technology@Intel Magazine, 2006. 60–70, 2010. [54] Intel, Inc. http://www.intel.com/products/processor/corei7/index.htm. [30] Analog Devices. http://www.analog.com. [55] Nvidia, Inc. http://www.nvidia.com. [31] Big Foot Networks, Inc. http://www.bigfootnetworks.com. [56] AMD Inc. http://www.amd.com. [32] Ageia. http://www.ageia.com. [57] Xilinx, Inc. http://www.xilinx.com. [33] J. Kahle, “The Cell Processor Architecture,” in MICRO’05: Pro- [58] Synopsys, Inc. http://www.synopsys.com. ceedings of the 38th Annual IEEE/ACM International Symposium on [59] M. Puschel et al., “SPIRAL: Code Generation for DSP Transforms,” Microarchitecture. Washington, DC, USA: IEEE Computer Society, Proceedings of the IEEE, vol. 93, no. 2, pp. 232 –275, feb. 2005. 2005, p. 3. [60] Nvidia. http://www.nvidia.com/object/cuda home new.html. [34] W. J. Dally et al., “Merrimac: Supercomputing with Streams,” in [61] C. Bienia et al., “The PARSEC Benchmark Suite: Characterization SC’03: Proceedings of the 2003 ACM/IEEE conference on Supercom- and Architectural Implications,” in PACT’08: Proceedings of the 17th puting. Washington, DC, USA: IEEE Computer Society, 2003, p. 35. International Conference on Parallel Architectures and Compilation [35] J. H. Ahn et al., “Evaluating the Imagine Stream Architecture,” in Techniques. New York, NY, USA: ACM, 2008, pp. 72–81. ISCA’04: Proceedings of the 31st Annual International Symposium [62] P. A. Milder et al., “Formal Datapath Representation and Manipula- on Computer Architecture. Washington, DC, USA: IEEE Computer tion for Implementing DSP Transforms,” in Proc. Design Automation Society, 2004, p. 14. Conference, 2008, pp. 385–390. [36] L. Seiler et al., “Larrabee: A Many-core Architecture for Visual [63] D. T. S. Thoziyoor et al., “Cacti 4.0,” HP Labs, Tech. Rep. HPL-2006- Computing,” ACM Trans. Graph., vol. 27, no. 3, pp. 1–15, 2008. 86, 2006. [37] AMD, Inc. http://www.fusion.amd.com. [64] M. Schuette. (2008) Core i7 Power Plays. http://www.lostcircuits.com/ [38] Tensilica, Inc. http://www.tensilica.com. mamb. [39] S. C. Goldstein et al., “PipeRench: A Coprocessor for Streaming [65] Intel, Inc. http://www.ark.intel.com/Product.aspx?id=35635. Multimedia Acceleration,” in ISCA’99: Proceedings of the 26th Annual [66] R. Kalla et al., “Power7: IBM’s Next-Generation Server Processor,” International Symposium on Computer Architecture. Washington, DC, IEEE Micro, vol. 30, no. 2, pp. 7–15, 2010. USA: IEEE Computer Society, 1999, pp. 28–39. [67] L. A. Polka et al., “Package Technology to Address the Memory [40] C. Ebeling et al., “RaPiD - Reconfigurable Pipelined Datapath,” in Bandwidth Challenge for Tera-scale Computing,” in Intel Technology FPL’96: Proceedings of the 6th International Workshop on Field- Journal, Volume 11, Issue 3, 2007. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs

ERIC S. CHUNG, MICHAEL K. PAPAMICHAEL, ERIKO NURVITADHI, JAMES C. HOE, and KEN MAI Computer Architecture Laboratory at Carnegie Mellon and BABAK FALSAFI Parallel Systems Architecture Laboratory Ecole´ Polytechnique Fed´ erale´ de Lausanne

Functional full-system simulators are powerful and versatile research tools for accelerating ar- chitectural exploration and advanced software development. Their main shortcoming is lim- ited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the PROTOFLEX simulation archi- tecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require signifi- cant effort to provide full-system support. In contrast, PROTOFLEX virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to de- liver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows im- plementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the PROTOFLEX simulation architecture, which is an FPGA- based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server, hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 38x speedup (and as high as 49×) over comparable software simulation across a suite of applica- tions, including OLTP on a commercial database server. We also demonstrate the advantages of

Funding for this work was provided in part by grants from the C2S2 Marco Center, NSF CCF- 0811702, NSF CNS-0509356, and SUN. This work was also supported by donations from Xilinx and Bluespec. Authors’ addresses: E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, and K. Mai, Computer Architecture Laboratory at Carnegie Mellon, 5000 Forbes Ave., Pittsburgh, PA 15213; email: [email protected]; B. Falsafi, Parallel Systems Architecture Laboratory, Ecole´ Polytech- nique F´ed´erale de Lausanne, Lausanne, Switzerland. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commer- cial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. 15 c 2009 ACM 1936-7406/2009/06-ART15 $10.00 DOI: 10.1145/1534916.1534925. http://doi.acm.org/10.1145/1534916.1534925.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 2 · E. S. Chung et al. minimal-overhead FPGA-accelerated instrumentation through a CMP cache simulation technique that runs orders-of-magnitude faster than software. Categories and Subject Descriptors: C.4 [Computer Systems Organization]: Performance of System—Modeling techniques General Terms: Design, Experimentation, Performance Additional Key Words and Phrases: FPGA, simulator, emulator, prototype, multicore, multiprocessor ACM Reference Format: Chung, E. S., Papamichael, M. K., Nurvitadhi, E., Hoe, J. C., Mai, K., and Falsafi, B. 2009. PROTOFLEX: Towards scalable, full-system multiprocessor simulations using FPGAs. ACM Trans. Reconfig. Techn. Syst. 2, 2, Article 15 (June 2009), 32 pages. DOI = 10.1145/1534916.1534925. http://doi.acm.org/10.1145/1534916.1534925.

1. INTRODUCTION The rapid adoption of parallel computing in the form of multicore chip mul- tiprocessors has re-energized computer architecture research. The research focus has now shifted toward architectural mechanisms for enhancing pro- grammability and scalability in future, highly concurrent computing plat- forms. Architectural research of this kind will require close collaborations between hardware and software researchers. To codevelop new systems suc- cessfully, two conflicting dependencies must be resolved. First, software researchers cannot afford to wait until the hardware development cycle is complete. Second, developing new hardware requires feedback from software research, which is difficult to attain before a design can be finalized. Fast and flexible functional full-system simulators will play a vital role in helping to resolve these dependences. In the past and particularly in the uniprocessor setting, full-system func- tional simulators (e.g., Rosenblum et al. [1995] ,Magnusson et al. [2002], Emer et al. [2002], Bohrer et al. [2004], Bellard [2005], Martin et al. [2005], Wenisch et al. [2006], Binkert et al. [2006], Yourst [2007], Over et al. [2007], AMD [2008]) have stayed within a 100× slowdown relative to real systems. Until re- cently, this slowdown remained relatively unchanged and was deemed accept- able by many hardware and software researchers. However, when simulating ever-larger multiprocessor systems, the slowdown increases at least linearly in proportion to the number of simulated processors. As a result, existing functional simulators are used to simulate only tens of processors at speeds practical for functional exploration. Prior attempts to parallelize multiprocessor simulation performance have shown that the scalability of a distributed simulation is limited if the sim- ulated components must interact at an interval below the communication granularity supported by the underlying host system [Reinhardt et al. 1993; Mukherjee et al. 2000; Legedza and Weihl 1996; Chidester and George 2002; Penry et al. 2006]. In the particular case of parallelizing functional simula- tors, a fundamental tension exists between global synchronization overhead and the instruction interleaving granularity of multiple simulated processors

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 3 during round-robin scheduling [Over et al. 2007; Lantz 2008]. While it is desir- able to increase the interleaving granularity, which improves cache utilization and parallelism on the host system, increasing it beyond thousands of cycles introduces unrealistic interleavings of memory events [Over et al. 2007]. In addition to the scalability challenges, software-based simulators experi- ence dramatic slowdowns when introducing any form of detailed instrumen- tation. While stand-alone functional simulators are useful for positioning and setup of workloads (e.g., booting the OS), the lack of timing fidelity makes them unsuitable for architectural studies. To address this, functional simula- tors are often augmented with more detail, which can range from first-order single-cycle cache simulators to full-blown microarchitectural timing models. Unfortunately, even simple models can introduce significant overheads (up to 10× or more slowdown as observed by Nussbaum et al. [2004]).

1.1 The PROTOFLEX Simulation Architecture In this article, we present the PROTOFLEX simulation architecture, which aims to leverage fine-grained parallelism in the form of FPGA acceleration to overcome both the scalability and instrumentation performance bottleneck. While FPGAs potentially overcome many of the aforementioned challenges, they also introduce problems on their own, such as extended development time and high cost of ownership. The goal of the PROTOFLEX simulation architec- ture is to simulate the functional execution of a multiprocessor system using FPGAs at a level of development effort and cost justifiable in a computer ar- chitecture research setting. This goal is distinct from prototyping the accurate structure or timing of a multiprocessor system. Specifically, the requirements for the architecture are to: (1) functionally simulate large-scale multiprocessor systems with an acceptable slowdown (<100x), (2) model full-system fidelity to execute realis- tic workloads including operating systems, and (3) provide fast, low-overhead instrumentation for architectural and software research. In what follows, we briefly explain how PROTOFLEX achieves these goals.

Addressing Large-Scale System Complexity. Prior FPGA-based simulators– even functional simulators (e.g., Wee et al. [2007])—implemented their models in a way that maintains structural correspondence to the simulated target system and the underlying host.1 For example, 16 independent cores would be instantiated and integrated together to deliver the functionality of a 16-way multiprocessor. While evidently feasible with small-scale systems, building an FPGA-based simulator with adherence to structural correspondence becomes difficult to justify when studying configurations with hundreds or thousands of processors. We observe that in the case of providing fast, functional simulation, im- plementing the target system as-is in an FPGA is unnecessary. In the

1We refer to “target” as the simulated computer system being studied while the term “host” refers to the underlying software or hardware simulation infrastructure that carries out the behavior of the target system.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 4 · E. S. Chung et al.

PROTOFLEX simulation architecture, the logical execution and resources of many processors can be virtualized onto a consolidated set of host execution en- gines on the FPGA. Within each engine, multiple processor contexts are inter- leaved onto a multiple-context high-throughput instruction pipeline. Through virtualization, the simulator architect is able to judiciously scale or consolidate the number of engines in the FPGA host, as needed, to deliver on necessary simulation performance.

Addressing Full-System Complexity. Apart from limitations in scale, prior FPGA efforts also typically forgo full-system fidelity and are limited to user- level custom-compiled applications (e.g., Oner¨ et al. [1995]). Full-system fi- delity enables a simulator to model real machines to a level of functional detail (e.g., devices) such that software, including the BIOS and operating systems, cannot make the distinction between the simulator and a real machine. This capability is supported in software-based simulators today (e.g., Rosenblum et al. [1995], Magnusson et al. [2002], Emer et al. [2002], Bohrer et al. [2004], Bellard [2005], Martin et al. [2005], Wenisch et al. [2006], Binkert et al. [2006], Yourst [2007], Over et al. [2007], AMD [2008]) and is especially important for evaluating applications on real operating systems. Implementing the same capabilities entirely in FPGAs would require detailed hardware design knowl- edge and effort typically beyond the resources of the simulator architect. To address this challenge, we observe that the great majority of dynami- cally encountered behaviors in a full-system simulation make up a very small subset of total system behaviors. It is this small subset of behaviors that de- termines the overall simulation performance. To exploit this observation, the PROTOFLEX simulation architecture incorporates a hybrid simulation tech- nique called transplanting, which allows a simulated entity such as a processor to be dynamically reassigned from FPGA hardware into software-based simu- lation. The simulator architect is now given the option to select only a subset of common-case behaviors for acceleration on the FPGA while still retaining the abstraction of a complete system.

Accelerated Instrumentation Using FPGAs.. A major advantage of FPGA- accelerated simulation is the ability to dynamically instrument internal execution state with minimal overheads in performance. As opposed to software-based simulators, which must divide their time between the func- tional simulator and the instrumentation code, multiple FPGA-resident in- strumentation components can be attached to an FPGA-based simulator while operating in parallel. The ability to carry out fast functional instrumentation is beneficial for a wide range of hardware research activities such as trace generation, checkpoint creation, workload characterization, and architectural studies. In addition to assisting with hardware research, FPGAs are a promising vehicle for fast dynamic monitoring and debugging of software programs. Con- ventional software techniques that either rely on functional simulators or dy- namic binary instrumentation [Srivastava and Eustace 1994; Patil et al. 2004; Nethercote and Seward 2007] are forced to make a delicate trade-off between

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 5 the slowdown of the instrumented program versus the level of instrumentation detail. In fact, the extreme overheads of runtime instrumentation mechanisms have led to hardware-based instrumentation proposals such as MemTracker for debugging [Venkataramani et al. 2007], Raksha for security [Dalton et al. 2007], and Log-based Architectures for general-purpose instruction-grain mon- itoring [Chen et al. 2008]. As a demonstration of high-performance FPGA instrumentation, we devel- oped a functional FPGA-based simulator of a CMP cache hierarchy that runs orders-of-magnitude faster than a comparable software-based simulator. As we describe in Section 4, the fast cache model is used in tandem with the SMARTS [Wenisch et al. 2006] statistical sampling technique to significantly reduce the turnaround time of microarchitectural timing studies.

Contributions. We developed an instantiation of a functional full-system, FPGA-based simulator called BlueSPARC that incorporates the two key concepts of the proposed PROTOFLEX simulation architecture: (1) time- multiplexed interleaving and (2) hybrid simulation with transplanting. BlueSPARC is our first step toward simulating large-scale multiprocessor systems and currently models the architectural behavior of a 16-CPU UltraSPARC III SMP server. BlueSPARC is hosted on a single 16-way multi- threaded instruction-interleaved pipeline running on the BEE2 FPGA plat- form Chang et al. [2005], while Simics [Magnusson et al. 2002] provides the backing software substrate during hybrid simulation. In the fu- ture, we envision combining multiple BlueSPARC pipelines to simulate even larger systems. BlueSPARC currently can execute real applications on an un- modified Solaris 8 , including Online Transaction Processing (OLTP) on Oracle. Our performance evaluation shows that we achieve a 38x speedup average over comparable software simulation using Simics. In addi- tion, our FPGA-Accelerated Cache Simulator (FACS) operates in tandem with BlueSPARC to provide fast, functional simulation of a CMP cache hierarchy with negligible overheads in performance (less than 4% on average).

Outline. Section 6 describes work related to FPGA-based simulation and prototyping. Section 2 presents the PROTOFLEX simulation architecture in detail. Section 3 presents an instantiation of the PROTOFLEX simulation ar- chitecture called the BlueSPARC simulator. Section 4 describes an example of fast hardware instrumentation through the design and implementation of an FPGA-accelerated CMP cache simulator. Section 5 presents an evaluation and a performance model for the BlueSPARC simulator. We conclude in Section 7.

2. PROTOFLEX In this section, we expand on the PROTOFLEX simulation architecture for FPGA acceleration of functional full-system simulation. Note that PROTOFLEX is not a specific instance of simulation infrastructure but a set of practical approaches for developing FPGA-accelerated simulators.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 6 · E. S. Chung et al.

Fig. 1. Partitioning a simulated target system across an FPGA and software simulator during hybrid simulation.

2.1 Hybrid Simulation To reduce the complexity of full-system support, we developed a new tech- nique called hybrid simulation that aims to reduce the implementation effort of FPGA-based simulators. In hybrid simulation, components in a simulated system are selectively partitioned across both FPGA and software hosts. This technique is motivated by the observation that the great majority of behaviors encountered dynamically in a simulation are contained in a small subset of total system behaviors. It is this small subset of behaviors that determines the overall simulation performance. Thus, to improve software simulation perfor- mance while minimizing the hardware development, one should apply FPGA acceleration only to components that exhibit the most frequently encountered behaviors. Figure 1 offers a high-level example of the hybrid simulation of a multi- processor system. In the figure, all components are either hosted on the FPGA or simulated by the reference simulator; specifically, the main memory and CPUs are hosted in an FPGA while the remaining components are hosted in a software-based simulator (e.g., disks and network interfaces, etc.). When a software-simulated DMA-capable I/O device accesses memory, it accesses a hardware memory module on the FPGA platform. In a simulation, both the FPGA-hosted and software-simulated components are advanced concurrently to model the progress of a complete system (e.g., simulating disk activity in parallel with processors running on the FPGA).

Transplanting. A component such as the CPU can be partitioned into a small core set of frequent behaviors and an extensive set of complicated or rare behaviors. Assigning the complete set of CPU behaviors statically to soft- ware simulation or to FPGA results in either the simulation being too slow or the FPGA development being too complicated. The conflicting goals can be rec- onciled by supporting “transplantable” components which can be reassigned to the FPGA or software simulation at runtime. In the CPU example, the FPGA would only implement the subset of the most frequently encountered instruc- tions. When the partially implemented CPU encounters an unimplemented

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 7

Fig. 2. Improving performance with hierarchical transplanting. behavior (e.g., a page table walk following a TLB miss), the FPGA-hosted CPU component is suspended. Immediately after, the CPU state is “transplanted” to its corresponding software-simulated component in the reference simulator (number 1 in Figure 1). The software-simulated CPU component is activated to carry out the unimplemented behavior and deactivated again afterwards. Finally, the CPU state is transplanted back to the suspended FPGA-hosted CPU component to resume acceleration on the FPGA (number 2 in Figure 1).

Hierarchical transplanting. While transplanting is a straightforward way to achieve full behavior coverage of a complex component like the CPU, there is a significant performance overhead associated with the hand-off between FPGA and software simulation. This includes both the time to transfer data between the FPGA and the software hosts, and the time for the software sim- ulator to carry out the unimplemented behavior on behalf of the FPGA. This exchange could take several milliseconds if the FPGA and the software simu- lation hosts are separated far apart (e.g., over Ethernet). The effective hybrid simulation throughput for a 1-CPU pipeline with trans- planting is

MIPSraw MIPSeffective ≈ , 1 + MIPSrawratetplant-per-millionLtplant where MIPSraw is the FPGA-accelerated simulation throughput discounting transplant cost; ratetplant is the number of transplants per million instructions; and Ltplant is the latency cost of a transplant in seconds. Figure 2 (left) illus- trates an example calculation assuming a 100 MHz FPGA-hosted CPU with CPI=1. Even if the FPGA-hosted CPU transplants just once every one mil- lion instructions, 50% of the 100 MIPS raw throughput potential would be lost due to the high transplant penalty (10msec=1 million cycles). Because MIPSraw also appears in the denominator, attempting to improve performance by increasing MIPSraw would yield diminishing returns. Attempts to raise ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 8 · E. S. Chung et al.

Fig. 3. Simulation slowdown versus system size (assuming a single real processor performs at 1 GIPS).

performance by reducing ratetplant face a different diminishing returns because increasingly more instructions would have to be built in the FPGA-hosted CPU to reduce the transplant rate further. A better solution is to introduce a hierarchy of CPU transplant hosts with staggered, increasing instruction coverage and performance costs. For exam- ple, an embedded processor on the FPGA (e.g., the embedded PPC405 in the Xilinx Virtex architecture or even a synthesized soft core) can support a simple software-based ISA simulation kernel. The simple software-based simulation kernel would still be slow relative to the FPGA-hosted CPU, but would be much faster than a full transplant to the software simulator. The embedded simula- tion kernel should also be able to reach, with relative ease, a higher instruction coverage in software than in the FPGA-hosted CPU. This reduces the rate of transplants to the reference software simulator. Returning to the example in Figure 2, suppose we introduce an embed- ded software-based simulation kernel as an intermediate CPU transplant host with CPI=1000. Further suppose that the intermediate simulation kernel re- quires transplant to the full reference simulator only once every 10 million instructions. Then, 90% of the 100 MIPS raw throughput potential would be preserved.

2.2 Interleaved Multiprocessor Simulation In our experience, interactive software research cannot tolerate more than 100x slowdown. Batched performance simulation studies or instrumented functional simulation activites can tolerate as much as 1000x to 10,000x slow- down. Assuming three simulators with aggregate simulation throughputs of 10, 100, and 1000 MIPS, respectively, Figure 3 plots the user-perceived simu- lation slowdown (y-axis) of the given simulator relative to a real system. As

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 9

Fig. 4. Large-scale multiprocessor simulation using a small number of interleaved engines. the system size increases, the aggregate simulation throughput is divided by the number of processors being simulated (x-axis). We assume a “real” mul- tiprocessor system consists of 1000-MIPS processors. As seen in Figure 3, a 100-MIPS simulator represents only a 10x perceived slowdown in a uniproces- sor simulation, but is only practical for studying up to tens of processors before the slowdown is no longer acceptable. Figure 3 also suggests that developing a simulator that is faster than 1000 MIPS would exceed the performance needed for effective hardware and software research of a 100-way or even 1000-way multiprocessor system.

A performance-driven approach. The straightforward approach to construct a large N-way multiprocessor FPGA-based simulator is to replicate N cores and integrate them together in a large-scale interconnection substrate. While this meets the requirement of simulating a large-scale system, developing an FPGA-based simulator with up to hundreds or even thousands of proces- sors would be prohibitive with respect to development effort and required re- sources. The final aggregate throughput would also be 100 to 1000x faster than needed. Instead, a performance-driven approach would trade the excess simulation performance for a more tractable hardware development effort and cost. The PROTOFLEX simulation architecture advocates virtualizing the simu- lation of multiple processor contexts onto a single fast engine through time- multiplexing. Virtualization decouples the scale of the simulated system from the required scale of the FPGA platform and the hardware development ef- fort. The scale of the FPGA platform is only a function of the desired through- put (i.e., achieved by scaling up the number of engines). Figure 4 illustrates the high-level block diagram of a large-scale multiprocessor simulator using a small number of interleaved engines. In the figure, multiple simulated proces- sors in a large-scale target system are shown mapped to share a small number of engines.

Selecting the engine design. When selecting the architecture for a multiple- context engine, there are two desirable requirements: (1) to exploit the concurrency between multiple threads efficiently, and (2) maximize the ac- curacy of the functional simulation by providing fine-grained interleaving of memory events. These requirements led us towards selecting an instruction- interleaved processor pipeline, which is well-suited for executing multiple

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 10 · E. S. Chung et al. instruction streams efficiently. An instruction-interleaved pipeline switches a processor context on every fetch cycle, allowing for multiple processor con- texts to share a single pipeline. This design enjoys several implementation advantages as exemplified in well-known multithreaded architectures such as the CDC Cyber [Thornton 1995] and the HEP barrel processor [Smith 1985]. First and foremost, the logic associated with stalling and internal forwarding can be eliminated entirely if only at most one instruction from each context is allowed in the pipeline at any given time. When doing so, longer pipelines can be used to mitigate critical timing paths, which enables better overall clock frequency. Second, an interleaved pipeline is highly tolerant to long-latency events such as cache misses and transplants since additional threads can exe- cute in the shadow of a long-latency event. Finally, in an implementation with caches, multiple threads, running on a single engine, are automatically kept coherent in shared memory as they interleave onto a single L1 cache; in larger implementations with multiple engines this helps to consolidate the number of nodes in a cache-coherent design. Apart from simplifying the design in FPGA, a properly scheduled instruction-interleaved pipeline is able to provide a single-cycle round-robin interleaving of memory events from multiple simulated processors without in- curring a performance overhead. Software-based multiprocessor functional simulators typically face an inherent tension between the instruction inter- leaving granularity (called the time-slice quantum) and the performance of the simulator. In the case of single-threaded simulators, increasing the time- slice quantum improves performance by allowing the simulation to benefit from warm cache state in the host’s memory hierarchy before switching to an- other context [Witchel and Rosenblum 1996]. In the case of parallel simulators, the time-slice quantum also dictates the interval at which multiple simulated threads are required to synchronize [Reinhardt et al. 1993; Mukherjee et al. 2000; Over et al. 2007; Lantz 2008]. Prior work has shown that parallelized simulators are scalable only when the quantum is set to sufficiently large sizes [Reinhardt et al. 1993; Mukherjee et al. 2000; Over et al. 2007; Lantz 2008]. While this still produces a function- ally correct simulation and can be suitable for workload positioning, various studies [Witchel and Rosenblum 1996; Over et al. 2007] have reported unre- alistic interleavings of memory events when the time-slice quantum exceeds hundreds or thousands of instructions. For example, a large quantum may result in a simulated processor acquiring a lock for an unrealistic amount of time. Unfortunately, while single-cycle interleaving is desirable, reducing the quantum limits the scalability of software-based parallelized functional sim- ulators as well as the performance of single-threaded simulators. In the case of FPGAs, increasing the amount of instruction interleaving actually improves the efficiency of the hardware as well as overall throughput.

Scaling to multiple engines. The integration of multiple interleaved en- gines will be necessary to achieve scalable simulation throughput. To support shared-memory architectures, implementing cache coherency and interconnect mechanisms across multiple engines and possibly across multiple FPGAs is

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 11

Table I. BlueSPARC Interleaved Engine Characteristics Processing Nodes 16 64-bit UltraSPARC III Contexts 14-stage interleaved pipeline Caches 64KB I-cache, 64KB D-cache, 64B, direct-mapped Writeback, Non-blocking loads/stores, allocate-on-write 16 outstanding misses, 4-entry store buffer Clock frequency 90MHz Main memory 4GB total memory Resources used on 33,508 LUTs (50%), 222 BRAMs (67%) Xilinx V2P70 Resources used with 42,206 LUTs (65%), 238 BRAMs (72%) debugging+monitors EDA tools Bluespec System Verilog Xilinx EDK 9.2i, XST 9.2i Statistics 25KlinesBluespec,511rules,89moduletypes necessary for modeling of large-scale configurations. From estimations shown earlier, we anticipate that tens of interleaved engines running at 100 MIPS would be sufficient to achieve minimal slowdown practical for 100-way or 1000- way architecture research. Depending on available FPGA host and topology options, bus-based coherency mechanisms may be sufficient within a single FPGA. For simulation configurations requiring a large number of FPGAs, a distributed shared-memory or hierarchical bus-based architecture would be more appropriate. In similar vein to parallelized software-based simulators, care must be taken during scaling of multiple engines to avoid having distributed engines execute unrealistic memory interleavings. As opposed to software, FPGAs al- low for extremely fine-grained communication mechanisms (e.g., sending or processing a message using a state machine). This could be used to facili- tate low-overhead synchronization between multiple engines. Furthermore, while multiple engines may gradually drift out of sync (for example, if different FPGAs derive their clocks from independent clock crystals), resynchronization would be relatively inexpensive and infrequent since FPGAs do not suffer the same types of workload imbalance issues that parallel software-based simula- tors have.

3. THE BLUESPARC SIMULATOR

3.1 Overview In this section, we describe BlueSPARC, our first instantiation of the PROTOFLEX simulation architecture. BlueSPARC is one of our first steps towards simulating large-scale multiprocessor systems. BlueSPARC models a 16-way symmetric multiprocessor (SMP) UltraSPARC III server (Sun Fire 3800). The BlueSPARC simulator combines Virtutech Simics [Magnusson et al. 2002] on a standard PC and a single 16-context interleaved engine on an FPGA for acceleration. Table I summarizes the high-level characteristics of the BlueSPARC interleaved engine (which we explain in more detail in the coming pages). The FPGA portion is hosted on a Berkeley Emulation Engine 2

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 12 · E. S. Chung et al.

Fig. 5. Allocating components for hybrid simulation in the BlueSPARC simulator. (BEE2) FPGA platform [Chang et al. 2005]. The BlueSPARC simulator embod- ies the two key concepts of the PROTOFLEX simulation architecture: (1) hybrid simulation to achieve both performance and full-system fidelity with reason- able effort and (2) decoupling the logical system size from the physical de- sign complexity. Future work will develop the cache coherency support needed for combining multiple engines to support simulation of larger multiprocessor systems. The BlueSPARC simulator can serve the same functionality currently pro- vided by simulators such as Simics. In fact, the BlueSPARC simulator uses the same simulated console as Simics, so a user can even “interact” with the simulated server. Furthermore, BlueSPARC is fully capable of generating and loading Simics-compatible checkpoints of the entire simulated system state.

3.2 High-Level Organization Figure 5 shows a high-level block diagram of how we map the functionality of the target 16-way SMP server onto software simulation and hardware hosts. First, the main memory system is hosted directly by DDR2 DRAM on the BEE2. Next, all 16 target SPARC processors are mapped onto a single inter- leaved engine contained on one Xilinx Virtex-II XC2VP70 FPGA. The SPARC processors are transplant-capable components. In addition to the interleaved engine, the PPC405 embedded processors serve as an intermediate transplant host for the target SPARC processors when they encounter unimplemented be- haviors in the interleaved engine. The reference Simics simulator running on a PC provides the third hosting option for the target SPARC processors. Last but not least, all remaining components of the simulated system are hosted by the reference Simics simulator.

3.3 Implementation on BEE2 Figure 6 shows a structural view of our entire system mapped onto the BEE2 FPGA platform. The BEE2 board holds five Xilinx V2P70 FPGAs, which are connected to each other using low-latency, high-bandwidth 200 MHz links. The FPGA portion of the BlueSPARC simulator is hosted on two FPGAs. The cen- ter FPGA is used to host the BlueSPARC interleaved engine. This pipeline is connected over a high-speed interchip link to another FPGA, which hosts four DDR2 controllers for up to 4GBytes of main memory. In the center FPGA, the PPC405 interacts with the BlueSPARC engine over the Processor Local Bus (PLB) for servicing both transplants and I/O transactions. In addition to serving as an intermediate transplant host for the target SPARC processors,

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 13

Fig. 6. High-level organization of the BlueSPARC simulator. the PPC405 embedded processor also manages external communication with the simulation PC over Ethernet. The PC and the BEE2 are connected over a crossover Ethernet cable.

3.4 Hybrid Simulation and Transplanting The BlueSPARC simulator fully incorporates hybrid simulation and trans- planting to reduce implementation complexity. The BlueSPARC engine only supports the selected subset of the UltraSPARC III ISA required to achieve ac- ceptable performance. We select instructions for inclusion in the FPGA based on a quantitative profile of our commercial and integer benchmarks. A general rule of thumb in the selection was to implement in FPGA the set of behaviors to limit the rate of transplants to no more than 5 times per 10,000 instruc- tions in any of the software workloads. This rule of thumb was determined by first-order modeling of an interleaved pipeline augmented with transplanting. In the current BlueSPARC simulator, the in-FPGA software-based simula- tion kernel can capture the complete behavior of the UltraSPARC III proces- sor. The only time execution is transplanted to Virtutech Simics on the remote PC is when the CPUs interact with I/O devices. For clarity, we refer to a transplant to the in-FPGA software-based simulation kernel as a microtrans- plant, and we refer to a transplant to Simics as an I/O-transplant. Table II shows a detailed breakdown of how we partitioned the required behaviors of an UltraSPARC III processor for direct simulation in FPGA, microtransplant, and I/O-transplant. As shown in Table II, the BlueSPARC engine is designed to support all of the user-level instructions as well as a subset of privileged and implementation- dependent behaviors. We found that the frequent behaviors that must be included tend also to be the simpler ones, leaving the most complicated instructions to the embedded software-based simulation kernel. It is worth mentioning that some complex behaviors do occur sufficiently frequently to

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 14 · E. S. Chung et al.

Table II. Selected Partitioning in Hybrid Simulation BlueSPARC add/sub/shift/logical/multiply/divide instructions (FPGA) register windows and associated traps 38/103 SPARC ASI instructions interprocessor interrupt cross-calls device and software interrupts I- and D-MMU (including software TLB) Loads, Stores, Atomics (plus inverse-endian modes) VIS block memory instructions Micro-transplant 65/103 SPARC ASI instructions (on-chip simulation) VIS I/II multimedia instructions Floating point add/sub/mul/div and associated traps Floating point to/from integer conversion Alignment instructions Fixed-point arithmetic instructions TLB/cache diagnostic instructions TLB de-mapping operations Transplant PCIbus,ISP2200FibreChannel,i21152PCIBridge (off-chip simulation) IRQ Bus, Text console, SBBC PCI device Serengeti IO PROM, Cheerio-hme NIC Fibre Channel SCSI Disk, SCSI cdrom, SCSI bus warrant implementation in the simulation engine. Despite being a RISC ISA, the UltraSPARC III superset of SPARCV9 has a large number of complex implementation-dependent and legacy behaviors (e.g., SPARC ASIs).2 Despite these challenges, we found that our implementation was still much simpler than a full-blown prototype. Without microtransplants or I/O transplants, im- plementing all of the behaviors in Table II in hardware would have required a full design team (instead of just one graduate student in one year).

3.5 Interleaved Pipeline The BlueSPARC engine is a 14-stage, instruction-interleaved pipeline that supports the multithreaded execution of up to 16 UltraSPARC III processor contexts (see Figure 7). The maximum retirement rate of our pipeline is one in- struction per cycle, which in combination with the clock frequency dictates the peak simulation throughput. The primary goals in developing the BlueSPARC engine are: (1) to ensure correctness, (2) to maximize maintainability for future design exploration and (3) to minimize effort. In many cases we forgo complex performance optimizations in favor of a simpler, more maintainable design. Pipeline operation. In Figure 7, instruction processing begins at the Con- text Select Unit (left), which is a simple fair scheduler that issues in round- robin order from different processor contexts. The scheduler is responsible for ensuring that all processor contexts make equal progress on average. This

2ASIs or “Address Space Identifiers” are used to modify the behavior of SPARC memory instruc- tions. For example, ASIs can alter the endianness of a memory access or convert a normal memory instruction into a block transfer. ASIs instructions are also used to read and write registers in the Memory Management Unit for software TLB operations.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 15

Fig. 7. The BlueSPARC engine: a 14-stage instruction-interleaved pipeline. requirement is important in order to avoid unrealistic system behaviors (e.g., a processor holding a lock indefinitely). After being selected, each issued instruc- tion is tagged with a unique context identifier used to index various structures and registers replicated for each processor context throughput the pipeline (see shaded components). Once the instruction is selected, a subset of its architec- tural registers (e.g., pstate, current window pointer) are scanned from a state register file and passed into the pipeline. After fetch, the decoding stage deter- mines whether an instruction requires transplanting. Supported instructions in the common case proceed through the pipeline stages until reaching the writeback stage. At writeback, the processor context re-enters the Context Select Unit for the next round of scheduling. A processor context can only have one instruction in the pipeline at a time, therefore register file read-after-write hazards cannot arise.

Microtransplants and I/O-transplants. In the event a transplant operation is identified (either through the SPARC decoder or through I/O accesses in the MMU), the unimplemented instruction word and the processor context ID are queued into the Transplant unit (see Figure 7) for service by the embed- ded PPC405 processor running a small UltraSPARC III ISA simulation ker- nel. With 16 processor contexts and 14 pipeline stages, up to 2 transplanted contexts can be serviced while the remaining contexts keep the pipeline fully utilized. Only one outstanding microtransplant and one outstanding I/O- transplant are currently supported. Any additional I/O- or microtransplant requests from other contexts would be queued in the Transplant unit. Figure 8 illustrates step-by-step how the interleaved pipeline interacts with the embedded PPC405 during a microtransplant. After an unimplemented instruction is detected and queued in the Transplant unit, the PPC405 is in- terrupted (1) and begins to read a minimum amount of state from the request- ing context to decode the unimplemented instruction (2). After decoding, the PPC405 continues to read from the pipeline the state required to complete the unimplemented instruction. Reading and writing processor state (e.g.,

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 16 · E. S. Chung et al.

Fig. 8. Detailed operation of microtransplant and I/O-transplants. register file and cache reads) are carried out by issuing synthetic “helper” instructions into the engine. Once the input state is acquired, the ISA sim- ulation kernel computes the state updates (3) and writes the changed state values back to the engine for the requesting context (4)—again with the help of helper instructions—before returning the blocked processor context back to scheduling. Figure 8 also shows how I/O-transplants are handled. During simulation, Simics continuously polls the embedded PPC405 to examine for I/O activity queued in the transplant unit (1)/(2). If an I/O transaction (e.g., a memory- mapped load or store instruction) is pending, the PPC405 acquires the nec- essary state information from the engine and forwards the state to Simics in response to the polling (3)/(4). For I/O transplants, Simics simulates the memory-mapped I/O bus transaction to the target simulated devices (5) and returns the acknowledgement (6)/(7).

Other sources of serialization and latency. In addition to transplants, a va- riety of long-latency events can deter the progress of an instruction in the pipeline. The most frequently encountered cases are both instruction and data cache misses (due to the high degree of cache sharing among 16 processor con- texts). In our memory system, the caches are nonblocking, and up to sixteen misses to independent cache sets are allowed at any given time. Other con- texts may continue to utilize the pipeline while memory accesses are pending. To avoid conflicts between multiple contexts for cache sets in a transient state, each cache set is augmented with a pending bit in the tag array. Any subse- quent context that attempts to access a pending cache set is removed from the pipeline and forced to retry from the scheduler at a later time. To hide store misses, a 4-entry store buffer allows processor contexts to retire past a store miss; we currently do not implement store-to-load forwarding and thus any conflicting loads will stall until pending stores are drained into memory. A number of special instructions cannot finish in a single pass through the pipeline. For example, atomics (e.g., ldstub) require a load followed by a store.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 17

To prevent access to a locked cache line during this operation, any processor context that attempts to load or store to the same cache line is cancelled and re- enters the context select unit for another round of scheduling. Other examples of multipass instructions are Quad LDDs (atomic load of two doublewords), or block load/store instructions (64B loads into 8 consecutive registers). We implement a helper engine (see Figure 7) that issues helper instructions to fa- cilitate these rare multipass instructions. As noted earlier, the helper engine also facilitates state access during microtransplanting. Lastly, cache flushes and pipeline “freezes” can also introduce serialization into the pipeline. Be- cause the i- and d-caches are incoherent, we conservatively flush the caches on any possibility of staleness in the i-cache (i.e., when FLUSH instructions are encountered). The caches are also flushed when DMA operations are carried out. Pipeline “freezes” occur when the entire pipeline is stalled for a temporar- ily owned resource (e.g., during TLB replacement).

Nondeterminism. In the current instance, BlueSPARC is a nondeterminis- tic simulator (i.e., repeated multithreaded simulations do not necessarily yield the same interleaving order of threads). In particular, the interhost commu- nication mechanism for hybrid simulation uses standard Ethernet across the BEE2 and a PC workstation, which is unable to provide deterministic delivery of messages. This has wide-ranging nondeterministic effects, such as the un- predictable delivery of interrupts and DMA. Future work will examine mech- anisms that can provide deterministic delivery of interhost communication.

Clock frequency and area. As shown in Table I, BlueSPARC runs at a peak clock frequency of 90 MHz on a Virtex-II Pro 70. In our current design, the major critical timing paths arise from long FPGA interconnect delays when ac- cessing the large architecturally defined I- and D-TLB structures. At present, over 100 BRAMs alone are involved during TLB lookups, which makes routing at higher clock frequencies a major challenge. In the future, we will examine techniques that reduce the on-chip storage requirements by simply caching the TLB entries of all processor contexts within a small and shared host TLB cache. We also intend to migrate towards newer FPGA families such as the Virtex-5, which offer higher logic density and lower interconnect delay using 6-input LUTs. At present, BlueSPARC consumes a relatively large number of LUTs com- pared to most other available FPGA soft cores. As mentioned previously, our initial emphasis in design was correctness, and in some cases, we did not ex- ploit area-efficient FPGA design methods. With additional effort in the future, we expect to reduce our logic consumption by a factor of two or more.

3.6 Validation Our validation methodology involves a combination of mixed hardware/ software cosimulation and testing directly on the BEE2 system. We rely on Simics to provide us with a complete and correct reference model for the Ultra- SPARC III server, including the prevalidated device models. In general, all of our testing (either in RTL simulation or running directly on FPGAs) must pass

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 18 · E. S. Chung et al. validation by matching up exactly with the architectural state that Simics gen- erates. Our validation test suite comprises a subset of ported diagnostics from the OpenSPARC framework [Vahia and Hartke 2007], the SPARC Interna- tional V9 compliance tests, autogenerated stress-test programs, hand-written test cases, and real applications/benchmarks. We relied on a variety of strategies to validate our design in the presence of nondeterminism in the interleaved pipeline. In order to match architectural state with Simics, we implement a hardware “event recorder” that records the global commit order of instructions within the pipeline for each processor context. The event recorder also captures the timing of asynchronous events during runtime (e.g., the exact point of delivery of an interrupt). This ordering is scanned from the FPGA on the BEE2 and replayed in Simics to attain iden- tical architectural and memory state for comparison. Another validation strat- egy we adopted was to implement over 200 synthesizable assertion instances in our design. These assertions were monitored using the Chipscope built-in logic analyzer and responsible for detecting over 50% of the design bugs. We have successfully passed all of our test suites and have been able to validate billions of instructions in all of our workloads using the event recorder. We also have been able to interact with the hardware using a simulated console in Solaris and have been able to run console applications correctly without asser- tions firing or software crashes occurring. In all, we have high confidence that BlueSPARC is validated to a level suitable for active use by end-users. In the following section, we extend the BlueSPARC implementation to support low-overhead instrumentation to accelerate microarchitectural exploration.

4. FPGA-ACCELERATED INSTRUMENTATION A key advantage of FPGA-accelerated simulation over conventional software simulation is the support for very low overhead instrumentation, which can range from a simple set of passive counters that monitor interesting processor- or instruction-level events up to simulating a complete multiprocessor cache model. The BlueSPARC design presented previously is particularly well suited for trace-based instrumentation, where the processor pipeline generates a stream of trace information that can be processed by an independent instru- mentation component. While software simulators experience dramatic slow- down during the trace generation process, an FPGA-accelerated simulator, such as BlueSPARC, is able to deliver traces to another hardware instru- mentation component, possibly residing on a separate FPGA, with minimal overhead. It is worth noting that if the FPGA instrumentation component is unable to handle traces at full rate, the functional simulator must be appropri- ately backpressured and stalled to avoid losing trace information; this neces- sates careful design of the instrumentation component to avoid a performance bottleneck. In the following section, we present an example application of FPGA- accelerated instrumentation that uses FPGA-based cache simulation to reduce the turnaround time of simulation sampling methodology.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 19

4.1 Accelerating Simulation Sampling Using FPGAs While fast functional simulation is an essential ingredient for accelerating architectural studies, improving microarchitectural simulation performance is also of major importance. As opposed to ongoing work such as Chiou et al. [2007] and Pellauer et al. [2008] that implement cycle-accurate timing models directly in FPGAs, we present an alternative approach that combines soft- ware simulation sampling methodology with FPGA-accelerated instrumenta- tion, which gives accurate performance results for homogeneous multithreaded workloads. For brevity, we refer the reader to Wenisch et al. [2006] for a more detailed description of the sampling methodology and its limitations. The sampling methodology uses statistical sampling theory to estimate the performance of applications running in software-based microarchitecture simulators. In a sampling approach, an instrumented functional simulator executes a benchmark from beginning to end while generating checkpoints of machine state at selected intervals. From each of these checkpoints, inde- pendent cycle-accurate simulations are launched in parallel and execute for a short duration. By combining together a sufficiently large number of samples, it is possible to attain an accurate mean value estimation of a chosen metric (e.g., IPC). The main disadvantage to the sampling approach is the long turnaround times associated with instrumented functional simulation. During checkpoint creation, a functional simulator must also be instrumented to carry out warm- ing of long-term microarchitectural structures such as caches and branch predictors. When instrumented for warming activity, full-system functional simulators typically run at around 1–2 MIPS [Wenisch and Wunderlich 2005]. For benchmarks consisting of hundreds of billions or trillions of instructions, this process can take tens of hours or up to days.

Supporting fast functional warming with FACS. To demonstrate FPGA- accelerated functional warming, we have developed an FPGA-Accelerated Cache Simulator (FACS), which functionally models a two-level Piranha-like chip multiprocessor cache with private L1 I&D caches and a shared non- inclusive L2 [Barroso et al. 2000]. This particular cache model is identical to the software-based one used for functional warming simulation in the Sim- Flex toolkit [Wenisch and Wunderlich 2005], which uses the aforementioned sampling-based methodology. FACS is an alternative to software-based CMP cache warming and is able to execute at nearly one or two orders of magnitude faster than its software counterpart (tens of MIPS as opposed to approximately one MIPS). A key advantage of functional simulation is that it is only required to repli- cate the higher-level behaviors of the simulated target structure instead of trying to mimic low-level timing details. The implementation details of FACS that follow demonstrate how this less-constrained setting allows for simpler and more optimized designs that take full advantage of hardware-specific fea- tures, such as concurrent lookup of multiple memory blocks. Through these practical techniques, FACS is able to run on a separate BEE2 FPGA and can receive and process dynamically generated memory address

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 20 · E. S. Chung et al.

Table III. FACS Implementation Characteristics L1 Caches 16 private 2-way split I&D (64B blocks) (2 stage pipeline, 1 cycle/ref) L2Cache 8-waysinglesharedvictimcache(64Bblocks) (4 stage pipeline, 2 cycles/ref) Clock frequency 100MHz Resources used 7483 LUTs (11%), 134 BRAMs (40%) (when configured with 64KB L1s, 4MB L2) 7277 LUTs (11%), 292 BRAMS (89%) (when configured with 128KB L1s, 16MB L2) Statistics 2500linesofVerilog,8modules traces over a high-bandwidth inter-FPGA link from BlueSPARC at nearly full speed. For each incoming memory reference in the trace, FACS updates its L1 and L2 cache tags and maintains coherence among the private L1 caches. It is important to note that FACS has no actual functionality (i.e., there is no data stored in the caches) but simply tracks the long-term coherence state of each simulated cache block (to later export into a software cycle-accurate simulator as per the sampling methodology).

Architecture and Implementation. Before presenting the actual implemen- tation of FACS, we first discuss additional details relevant to the target cache model. In the Piranha-like cache hierarchy, processors are connected to private split L1 I&D caches and share a common L2, which acts as a large victim cache, that is, only blocks that are evicted from L1 get inserted into the L2. Blocks are also inserted into the L2 when transferring blocks between private L1s. In addition to storing its own cache blocks, the L2 maintains a directory for all of the L1 private caches to preserve cache coherence. The L1 and L2 uses the same pseudo-LRU replacement strategies implemented in the software-based reference model. The cache architecture described here belongs to the target simulation model and does not reflect the actual FACS implementation. FACS is a pure functional cache model, which means that it requires only storing and updat- ing tag and status bits for each cache-line. Not storing the actual data in each cache-line greatly reduces the amount of required memory. For instance, modeling 16 64KB L1 caches, along with a single shared 4MB L2 cache re- quires less than 300KB of memory, which easily fits in on-chip FPGA memory resources (e.g., BlockRAMs). Moreover, FACS is fully parameterizable in the number of nodes, L1/L2 block sizes, L1/L2 cache sizes, and L2 associativity. The default configuration of FACS, which was also used to generate experi- mental results presented in Section 5, consists of 16 private 64KB split L1 I&D caches and a single shared 4MB L2 cache. Table III summarizes the high-level characteristics of FACS. FACS is structured as a 6-stage pipeline, which processes the memory ad- dress trace generated from BlueSPARC through an inter-FPGA FIFO inter- face. The architecture of FACS is depicted in Figure 9. Internally, it consists of two core modules; one that simulates the set of all private L1s and one that simulates the larger shared L2, which are implemented as a 2-stage and a 4-stage pipeline, respectively. The L1 module can sustain the processing rate

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 21

Fig. 9. FACS internal pipeline organization. of one memory reference per cycle, while the L2 module is limited to accept- ing one memory reference once every two cycles. Each memory reference that is processed by the L1 may produce up to two corresponding memory refer- ences that are forwarded to the L2: one for the actual reference and one for the potentially evicted victim cache-line. However, as we show in Section 5, the rate difference between the L1 and L2 modules as well as the addition- ally generated L2-specific memory references have negligible impact on overall simulation performance. On the occasion that the input FIFO to FACS is full, inter-FPGA flow control measures ensure that the BlueSPARC pipeline stalls to avoid dropping any memory references. In the target Piranha-like cache model, processors can issue multiple con- current independent accesses to their respective private L1s, while coherence is maintained through a shared directory that resides in the L2. By taking advantage of the memory reference serialization that occurs due to instruction interleaving in the BlueSPARC simulation engine, FACS follows a different approach that is simpler, achieves high performance, and still retains func- tional correctness. Instead of preserving coherence by maintaining a directory in the L2, FACS concurrently accesses all of the private L1s for each memory reference, in a similar manner to snoop-based coherence protocols. This allows coherence conflicts to be resolved instantaneously, without requiring propaga- tion to the L2 and potential subsequently generated coherence messages (e.g., invalidations) that travel back to one or more L1s. Although this approach limits the throughput of FACS, it greatly simplifies the design and completely decouples the L1s from the L2 by eliminating any potential L2-to-L1 feedback.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 22 · E. S. Chung et al.

The L1s can process references at double the rate of the L2 module. Thus more than half of the total memory references would have to miss in the L1s on average for the L2 module to become the throughput bottleneck. In all of our workloads, the average L1 miss-rate is well below the 50% threshold; additionally the FIFO connecting the L1s and L2 modules can smooth out short bursts of consecutive L1 misses.

5. EVALUATION This section reports an evaluation of the 16-way BlueSPARC simulator de- scribed in Section 3 as well as the FACS component described in Section 4. The performance evaluation includes a comparison to the Simics simulator. Furthermore, we present a first-order model to understand the performance of multithreaded interleaving.

5.1 Simulation Throughput

Benchmarks. The evaluations in this section are based on simulating a 16-way symmetrical multiprocessing (SMP) UltraSPARC III server (Sun Fire 3800) using BlueSPARC (with and without the FACS CMP cache simulation). The simulated server is running an unmodified Solaris 8 operating system. We exercise the simulated server using 16-way software workloads for satu- rating the pipeline. These consist of six SPEC2000 integer benchmarks (bzip2, crafty, gcc, gzip, parser, vortex) and a TPC-C Online Transaction Processing (OLTP) benchmark. For the SPECint workloads, 16 independent copies of the benchmark program are executed concurrently on the simulated 16-way server; we measure simulation throughput for 100 billion aggregate instruc- tions (after the program initialization phase). Due to limited physical mem- ory on our target configuration, it was necessary to run with test inputs and to omit benchmarks that exhibited excessive disk paging activity after pro- gram initialization (gap, mcf). Other SPECint benchmarks (eon, perlbmk, twolf, and vpr) were omitted due to a high rate (>1 in 1000) of unimplemented instructions (e.g., floating point instructions). For the OLTP benchmark, the simulated server is running a commercial multithreaded database server (Oracle 10g Enterprise Database Server) configured with 100 warehouses (10GB), 16 clients, and 1.4GB SGA as in Wenisch et al. [2006]. We measure simulation throughput for 100 billion aggregate instructions in a steady-state region (where database transactions are committing steadily). Due to nonde- terminism of performance in our current implementation,3 each bar graph we report is the average over four samples for each workload. We omit all of the standard error bars, which have values of 1% or less. As we will see next, dif- ferent workload behaviors have a large impact on the simulation throughput of both Simics and BlueSPARC.

3Several components in our current implementation of BlueSPARC are nondeterministic. One example is the Ethernet component, which is used to facilitate I/O-transplants and DMA transfers.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 23

Fig. 10. Performance comparison of BlueSPARC. The height of the lines shown inside the Oracle TPC-C bars indicate the performance of Oracle TPC-C in User MIPS. The listed percentages also indicate the overall fraction of user instructions. For all remaining SPECint workloads in both BlueSPARC and Simics, the rate of privileged mode instructions is less than 5%.

Simics performance. When Simics is invoked with the default “fast” option, it can achieve many tens of MIPS in simulation throughput. However, the fast mode is a purely black-box system; that is, it does not support instrumentation (e.g., memory address tracing) or augmentation of behavior (e.g., adding a new opcode). There is at least a factor of 10 reduction in simulation throughput when Simics is enabled with trace callbacks for instrumentation (this observa- tion is also noted by Nussbaum et al. [2004]). In Figure 10, the bars labeled Simics-fast and Simics-trace report our measured Simics simulation through- put (total simulated instructions per second) for the simulated 16-way SMP server. Simics simulations were run on a Linux PC with a 2.0 GHz Core 2 Duo and 8GBytes of memory. The performance more relevant to architecture research activities is represented by Simics-trace.

BlueSPARC performance. The simulation throughput of the BlueSPARC simulator is reported in the leftmost bars in Figure 10. The BlueSPARC sim- ulator with acceleration from a single Virtex II XCV2P70 FPGA clocked at 90 MHz achieves speed comparable to Simics-fast on the SPECint and Ora- cle TPC-C workload. In comparison to the more relevant Simics-trace perfor- mance, the speedup is more dramatic, an average speedup of 38x and as high as 49x on GCC. It is worth noting that some applications such as vortex and bzip2 run slower under BlueSPARC than in Simics-fast. These applications in particular experience a high rate of memory misses under BlueSPARC. It is also worth noting that the user-to-privileged instruction ratio changes signifi- cantly in Oracle TPC-C when comparing BlueSPARC to Simics (as noted in the percentages above the Oracle TPC-C bars in Figure 10). In Section 5.2, we ex- plain these results in detail and also discuss the major sources of performance overhead that prevent BlueSPARC from attaining ideal performance.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 24 · E. S. Chung et al.

FACS performance. As mentioned in Section 6, fast functional instrumen- tation is a key capability in accelerating cycle-accurate simulation sampling activities. To demonstrate this, the second-to-leftmost bars in Figure 10 report the performance of BlueSPARC executing with FACS running in parallel on a second FPGA at 100 MHz. Overall, BlueSPARC+FACS achieves a speedup of 37x over Simics-trace. The results we report here are lower-bound speedups because we do not include a software cache simulator when measuring the per- formance of Simics-trace. In practice, the introduction of a CMP cache simula- tor (not shown in this article) further reduces the performance of Simics-trace by over a factor of two, which means BlueSPARC+FACS potentially achieves up to nearly two orders-of-magnitude performance improvement. In the configuration described in Section 4, FACS simulates 16 indepen- dent 64KB L1 I&D caches and a shared 4MB L2 cache. At 100 MHz, FACS is able to consume up to 100M memory references per second. Figure 10 illustrates for the majority of our applications that FACS does not back- pressure BlueSPARC’s rate of progress. In the special case of GCC, a 14% reduction in performance is observed. When BlueSPARC without FACS ex- ecutes GCC, the IPC of the BlueSPARC pipeline reaches 0.95 during long phases of program execution (billions of instructions). In addition, nearly 40% of all instructions are memory accesses. As a result, BlueSPARC generates 0.95 ∗ 90MHz ∗ 1.4 = 120M memory traces per second during these phases, which exceeds the peak processing rate of FACS. When FACS is introduced, BlueSPARC is backpressured during these phases and is unable to reach the original peak IPCs of 0.95, resulting in the overall 14% reduction in perfor- mance. In Vortex, BlueSPARC+FACS appears to be slightly better in perfor- mance than BlueSPARC alone. However, the error bars (not shown in the figure) for BlueSPARC and BlueSPARC+FACS overlap due to variability in runtime, and no significant difference in performance is observable.

5.2 BlueSPARC Performance Analysis The performance model we present is based on a simple accounting of pipeline utilization by the interleaved processor contexts. Under ideal conditions, the instructions from multiple processor contexts would be issued into the pipeline in round-robin order, without stall or interference. Let N be the number of interleaved processor contexts and P be the depth of the interleaved pipeline. (In the case of the BlueSPARC design, N = 14 and P = 16.) Let I be the average interval (in cycles) between the instructions issued from one processor context into the pipeline. The pipeline utilization by one context 1 N is then I , and the total utilization by N contexts is I . When the pipeline is 100% utilized (IPC=1), I is equal to N. With full pipeline utilization, the aver- age instruction interval I can be broken into P + S cycles. The first component (P) represents the number of cycles traversed in a P-stage pipeline. The sec- ond component (S) is the amount of cycles spent queued up in the scheduler. The S component is nonzero due to the fact that only a subset of contexts at any given moment can occupy the pipeline when N > P.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 25

Table IV. Long-Latency Event Class Summary Flush Pipelineishaltedwhilethecachesarebeingflushed Retry A context is suspended due to a resource conflict I/O stall A context waits until a memory-mapped I/O operation completes Micro-tplant A context blocks until a micro-transplant is carried out Load stall A context misses in the D-cache & waitsuntilthe fill is completed I-Fetch stall A context misses in the I-cache & waits until the fill is completed Scheduler A context queueing up & being re-scheduled for another round of execution Pipeline The minimum amount of time needed for any instruction to complete

Consider the example of BlueSPARC. If only a single processor context is executing, the average instruction interval I is equal to 14 and the average number of cycles queued up in the scheduler S is zero. The pipeline utilization N 1 is thus I = 14 . When 16 contexts are running, each context’s average instruc- tion interval increases to I = 16 since each context must wait its turn for two cycles in the scheduler before running in the pipeline again. With 16 contexts, N N 16 the pipeline utilization is I = P+S = 14+2 = 1. Under real conditions, events such as cache misses, microtransplants and I/O-transplants increase the average interval I between the instructions is- sued from one processor context. We can express this as Iavg = P + S + r1 L1 + r2 L2 + r3 L3 + ... + riLi where ri is the rate (event per instruction) of a specific long-latency event class that delays the completion of an instruction by Li cy- cles. During long-latency events, the blocked processor context is removed from the pipeline and does not consume pipeline issue slots. Thus, a high pipeline utilization can be sustained even in the presence of these long-latency events, provided: (1) N > P and (2) the number of ready-to-execute contexts is at least P. A summary of long-latency event classes can be found in Table IV. Table V gives the rate ri (events per instruction) and average latency Li (in 90 MHz cycles) for the event classes that impact Iavg. All of these statistics are collected using performance counters built into the BlueSPARC pipeline. Using the measured data as input, the Iavg model agrees well with wall-clock measurements and is able to predict pipeline utilizations for all of the work- loads within 1% error. Figure 11 illustrates the measured contribution of each event class to the overall Iavg. The contribution of each component from an event class is computed as riLi. The components also include contributions from time spent in the pipeline (14 cycles) and time in the scheduler (S). Note that for workloads such as bzip2 or gcc, the time spent in the scheduler can be less than two cycles per instruction. In an otherwise perfectly scheduled pipeline with no long-latency events, this would not be possible; however, in the case of real workloads, processor contexts experiencing long-latency events can free up scheduling opportunities for others. It is also possible for contexts to experience more than two cycles per instruction in the scheduler due to stolen scheduling opportunities for retried instructions or when pipeline freezes occur (e.g., when resource hazards occur). Based on Figure 11, the three most dominant sources of performance over- head arise from fetch and load misses, retries, and memory-mapped I/O. For the majority of the workloads, fetch and load misses contribute a significant fraction of the average instruction interval; this is expected as we interleave

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 26 · E. S. Chung et al. es ts of 90 MHz cycles. ) 100%(1.8) 10.5%(3.0)) 100%(2.3) 0.0029%(3355) 12.8%(3.1) 0.0061%(3309) (104) 100%(1.0) 37.3%(14.3) 0.0039%(3257) %(118) 100%(1.2) 22.9%(7.3) 0.0038%(3274) 6%(110)3%(114) 100%(2.0) 100%(2.2) 23.5%(8.0) 30.0%(15.7) 0.0098%(3238) 0.0095%(3243) 067%(120) 100%(0.6) 13.9%(4.8) 0.00084%(3371) umbers in parentheses represent the latency per event in uni Table V. Long-Latency Event Rate and Latency Micro-tplant I/O Load miss Fetch miss Scheduler Retry Flush OracleBzip2CraftyGCC 0.012%(7554) 0.000055%(3787)Gzip 0.0000300%(4046)ParserVortex 0.000065%(145676) 0.000001%(162498) 0.0005%(668895) 0.0050%(9977) 0.000046%(22628) 0.000173%(2990) 7.01%(127) 8.6%(87) 2.95%(65) 0.00664%(2800) 0.000071%(130331) 2. 0.000066%(195679) 0.173% 0.000093%(124179) 1.097%(55 6.63%(106) 1.84%(92) 0.0002682%(120747) 5.34%(116) 8.65%(153) 0.23 0.936%(99 1.6 3.0 Percentages represent the rate of events per instruction. N

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 27

Fig. 11. Average contribution of each long-latency event class to the average instruction interval (Iavg). Each subcomponent within each bar is the product of the event rate (event per instruc- tion) and the average measured latency of the event. An explanation of each of the event classes can be found in Table IV. Table V also lists the actual rates and latencies used to generate the subcomponents in each bar.

16 processor contexts onto a small, direct-mapped cache. Vortex and crafty ex- perience the highest rate of cache misses and are also additionally penalized by increased queuing latency in the memory system. Recall that our memory controllers are hosted on a second FPGA; the interchip link between the two FPGAs limits the rate at which multiple outstanding memory requests can be serviced. In all of the workloads, retried instructions (including discarded computa- tion) contribute a noticeable fraction to the average instruction interval. In our measurements, the majority of retries are caused by conflicting accesses to the L1 data cache sets in a transient state (e.g., pending cache fill). Due to the direct-mapped configuration of the caches and a large window of vulnerability (e.g., long-latency cache fills), this event occurs frequently enough to contribute a significant overhead. Note how in Figure 11 that workloads with higher load and fetch stalls also experience a larger amount of retries. Microtransplants contribute little if no overhead in almost all of the work- loads. This outcome is due to the fact that we were conservative in our initial chosen instruction coverage in the FPGA; these results show that more oppor- tunity exists to omit even further behaviors from the hardware. As expected, none of the user-dominated SPECint benchmarks experiences any noticeable I/O. In contrast, Oracle TPC-C experiences a high level of disk activity and increased latency of I/O-transplants due to queuing of multiple outstanding I/O requests (note the higher latency in Table V).

Performance analysis of Oracle TPC-C. Despite a large I/O overhead, Oracle TPC-C running in BlueSPARC appears to have a significantly higher MIPS rate than Simics (50.5 MIPS compared to 12.5 MIPS). Further

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 28 · E. S. Chung et al. investigation into this unexpected result reveals that the instruction-level be- havior of Oracle changes dramatically when BlueSPARC is used instead of Simics (as shown above the bars in Figure 10). In particular, we observe that the rate of privileged-mode instructions when running under BlueSPARC in- creases to nearly 85% of all instructions. In contrast, Simics spends only up to 35% of all instructions in privileged mode. In prior work shown in Hankins et al. [2003], the user IPC is directly correlated to the rate at which database transactions are committed. These divergences in Oracle’s behavior are due to the fact that BlueSPARC introduces positive delay into the system with respect to memory-mapped I/O and DMAs. It is worth noting that Simics does not model any notion of accurate timing. All system-level events such as I/O, interrupts, or DMA occur in zero cycles. In the case of BlueSPARC, the introduction of posi- tive I/O delay causes additional queuing for device resources in the kernel and results in increased lock contention. Increased amount of spinning in the kernel improves the IPC of Oracle running under BlueSPARC despite a lower overall database transaction throughput. Nevertheless, the verdict on whether BlueSPARC or Simics provides the more “representative” behavior of Oracle TPC-C remains unclear. In future work, we plan to compare our results against real multiprocessor systems, which will offer better insight on how representative our simulators are with respect to Oracle’s runtime behavior.

6. RELATED WORK FPGA-based simulators. Recent FPGA efforts such as Lu et al. [2007] and Vahia and Hartke [2007] have produced full-system prototypes involving a CPU implemented in an FPGA, while real motherboards and PC components are used to provide the CPU-external full-system environment. In both exam- ples, the CPU designs are commandeered from existing RTL models developed originally for standard cell technologies. While RTL produces the most accu- rate timing model, the source code is generally more difficult to modify than simulation models crafted in mind for design exploration. Other approaches such as Oner¨ et al. [1995] and Wee et al. [2007] proto- type functional multiprocessor models with approximated timing fidelity; that is, they lack cycle-accuracy but retain timing similarities to the target sys- tem. These approaches forgo full-system fidelity and also maintain structural correspondence to the target system in the number of processors implemented in the FPGA. In similar spirit toward reducing FPGA implementation complexity, UT-FAST [Chiou et al. 2007] is a form of hybrid simulation that uses FPGAs to accelerate cycle-accurate simulation of uniprocessor microarchitectures. UT-FAST is currently divided into a software functional partition [Bellard 2005] and a timing model implemented in the FPGA. Similar work by HASIM [Pellauer et al. 2008] implements both the functional and the timing partitions within the FPGA.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 29

In addition, there have also been a number of mature projects in the RAMP project [Wawrzynek et al. 2007], including RAMP Blue [Krasnov et al. 2007] and RAMP Red [Wee et al. 2007], both of which follow a prototyping-like meth- odology for modeling large-scale message-passing machines and transactional memory, respectively. More recently, the RAMP Gold project is a new effort to develop a large-scale FPGA-based simulator for manycore and data center research [Tan et al. 2008]. Currently, the RAMP Gold strategy adopts the interleaving virtualization technique demonstrated in Chung et al. [2008] and primarily focuses on efficiently maximizing thread and core count per FPGA.

Software-based functional simulators. The need for full-system simulators has led to a large number of available implementations today, including SimOS [Rosenblum et al. 1995], Virtutech Simics [Magnusson et al. 2002], ASIM [Emer et al. 2002], Mambo [Bohrer et al. 2004], QEMU [Bellard 2005], GEMS [Martin et al. 2005], SimFlex [Wenisch et al. 2006], M5 [Binkert et al. 2006], PTLSim [Yourst 2007], Sulima [Over et al. 2007], and AMD SimNow [AMD 2008]. There have been a number of approaches for parallelizing machine simu- lators by Reinhardt et al. [1993], Mukherjee et al. [2000], Legedza and Weihl [1996], Chidester and George [2002], Penry et al. [2006], Over et al. [2007], Lantz [2008], and Wang et al. [2008]. The Wisconsin Wind Tunnel [Reinhardt et al. 1993; Mukherjee et al. 2000], in particular, uses a time quantum ap- proach to synchronize simulated processors across multiple host processors. The original work on Embra in Rosenblum et al. [1995] includes a parallel mode which allows translated code to execute across multiple hosts. The recent work on Parallel Embra [Lantz 2008] extends Embra to support multiple sim- ulated processors per host processor and also improves the overall scalability of the simulator by parallelizing internal data structures. The Sparc-Sulima work [Over et al. 2007] introduces two styles of parallel simulation: an “active backplane” mode for parallelized cycle-accurate simulation and a “passive backplane,” which supports a functional cache simulation using nondetermin- istic execution bounded by periodic global synchronization. In general, the parallel approaches presented here all face a delicate trade- off between scalability and accuracy. Furthermore, as mentioned previously, any software-based approach to instrumentation introduces significant slow- downs, regardless of how much parallelism is exploited between simulated processors. In terms of scalability and low-overhead instrumentation, we believe that FPGAs offer unique capabilities that overcome these common limitations.

7. CONCLUSIONS In this article, we presented the PROTOFLEX simulation architecture for FPGA-accelerated functional full-system multiprocessor simulation. The BlueSPARC simulator is the first instantiation of this architecture that simulates a 16-way UltraSPARC III SMP server. The resulting simulator em- bodies the two key concepts of PROTOFLEX: hybrid simulation with trans- planting and multiprocessor interleaving. Our evaluation of BlueSPARC

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 30 · E. S. Chung et al. showed a simulation throughput as high as 62 MIPS. The evaluation further showed a significant speedup, 38x on average and 49x best case, relative to Simics with instrumentation enabled. We also demonstrated the FACS archi- tecture, which functionally simulates a Piranha-like CMP cache model that operates in tandem with BlueSPARC while introducing less than 4% overhead in performance on average. FACS provides accelerated checkpoint generation for cycle-accurate simulation sampling methodologies and can also generate real-time cache statistics for the end-user. In the long run, we plan to scale up the BlueSPARC simulator by combining tens of interleaved engines in order to achieve an aggregate of 1000 MIPS to support the simulations of large (>100-way) multiprocessor systems. To suc- ceed at that scale, we also need to extend the PROTOFLEX simulation architec- ture with means to virtualize the required main memory capacity expected of a multiprocessor system at that scale. We are investigating the feasibility and performance impact of employing demand-paging against a larger but slower backing storage (e.g., disk or FLASH memory) to provide the illusion of the required main memory capacity. When combining multiple interleaved en- gines, coherence mechanisms must also be introduced to maintain the ab- straction of shared-memory. We are currently investigating bus-based and distributed cache coherence protocols to achieve this.

ACKNOWLEDGMENTS We thank our colleagues in the RAMP and TRUSS projects for their interaction and feedback. We also thank SPARC International for the use of the SPARCV9 compliance test.

REFERENCES AMD. 2008. Advanced Micro Devices, SimNow Simulator 4.4.3. User’s manual. BARROSO,L. A.,GHARACHORLOO,K., MCNAMARA, R., NOWATZYK, A., QADEER, S., SANO, B., SMITH, S., STETS, R., AND VERGHESE, B. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. SIGARCH Comput. Archit. News 28, 2, 282–293. BELLARD, F. 2005. QEMU, A fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC’05). USENIX Association, 41–41. BINKERT,N. L.,DRESLINSKI,R.G.,HSU,L.R.,LIM,K. T.,SAIDI,A. G., AND REINHARDT,S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52–60. BOHRER, P., PETERSON, J., ELNOZAHY, M., RAJAMONY, R., GHEITH, A., ROCKHOLD, R., LEFURGY, C., SHAFI, H., NAKRA, T., SIMPSON, R., SPEIGHT, E., SUDEEP, K., HENSBERGEN, E., AND ZHANG, L. 2004. Mambo: A full system simulator for the PowerPC architecture. ACM SIGMETRICS Perform. Eval. Rev. 31, 4, 8–12. CHANG, C., WAWRZYNEK, J., AND BRODERSEN, R. W. 2005. BEE2: A high-end reconfigurable computing system. IEEE Des. Test Comput. 22, 2, 114–125. CHEN, S., KOZUCH, M., STRIGKOS, T., FALSAFI, B., GIBBONS, P. B., MOWRY, T. C., RAMACHANDRAN, V., RUWASE, O., RYAN, M., AND VLACHOS, E. 2008. Flexible hardware ac- celeration for instruction-grain program monitoring. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, 377–388. CHIDESTER, M. AND GEORGE, A. 2002. Parallel simulation of chip-multiprocessor architectures. ACM Trans. Model. Comput. Simul. 12, 3, 176–200. CHIOU,D.,SUNWOO, D.,KIM, J., PATIL,N. A.,REINHART,W., JOHNSON,D.E.,KEEFE, J., AND ANGEPAT, H. 2007. FPGA-Accelerated simulation technologies (FAST): Fast, full-system, cycle-

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simulations · 15: 31

accurate simulators. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, 249–261. CHUNG, E. S., NURVITADHI, E., HOE, J. C., FALSAFI, B., AND MAI, K. 2008. A complexity- effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). ACM, New York, 77–86. DALTON, M., KANNAN, H., AND KOZYRAKIS, C. 2007. Raksha: A flexible information flow archi- tecture for software security. SIGARCH Comput. Archit. News 35, 2, 482–493. EMER, J., AHUJA, P., BORCH, E., KLAUSER, A., LUK, C.-K., MANNE, S., MUKHERJEE, S. S., PATIL, H., WALLACE, S., BINKERT, N., ESPASA, R., AND JUAN, T. 2002. Asim: A performance model framework. Comput. 35, 2, 68–76. HANKINS, R., DIEP, T., ANNAVARAM, M., HIRANO, B., ERI, H., NUECKEL, H., AND SHEN, J. 2003. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microar- chitecture. 151–162. KRASNOV, A., SCHULTZ, A., WAWRZYNEK, J., GIBELING, G., AND DROZ, P. 2007. RAMP Blue: A message-passing manycore system in FPGAs. In Proceedings of the Conference on Field Pro- grammable Logic and Applications. LANTZ, R. 2008. Fast functional simulation with parallel Embra. In Proceedings of the 4th Annual Workshop on Modeling, Benchmarking and Simulation. LEGEDZA, U. AND WEIHL, W. E. 1996. Reducing synchronization overhead in parallel simulation. SIGSIM Simul. Digest 26, 1, 86–95. LU, S.-L. L., YIANNACOURAS, P., KASSA, R., KONOW, M., AND SUH, T. 2007. An FPGA-based R in a complete desktop system. In Proceedings of the ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (FPGA’07). ACM, New York, 53–59. MAGNUSSON, P., CHRISTENSSON,M., ESKILSON, J., FORSGREN, D., HALLBERG, G., HOGBERG, J., LARSSON, F., MOESTEDT, A., AND WERNER, B. 2002. Simics: A full system simulation plat- form. Comput. 35, 2, 50–58. MARTIN, M. M. K., SORIN, D. J., BECKMANN, B. M., MARTY, M. R., XU, M., ALAMELDEEN, A. R., MOORE, K. E., HILL, M. D., AND WOOD, D. A. 2005. Multifacet’s general execution- driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4, 92–99. MUKHERJEE, S., REINHARDT, S., FALSAFI, B., LITZKOW, M., HILL, M., WOOD, D., HUSS-LEDERMAN, S., AND LARUS, J. 2000. Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator. Concurr. IEEE 8, 4, 12–20. NETHERCOTE, N. AND SEWARD, J. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, 89–100. NUSSBAUM, F., FEDOROVA, A., AND SMALL, C. 2004. An overview of the Sam CMT simulator kit. Tech. rep. TR-2004-133, Sun Microsystems Research Labs. O¨ NER, K., BARROSO, L. A., IMAN, S., JEONG, J., RAMAMURTHY, K., AND DUBOIS, M. 1995. The design of RPM: An FPGA-based multiprocessor emulator. In Proceedings of the ACM 3rd International Symposium on Field Programmable Gate Arrays (FPGA’95). ACM, New York, 60–66. OVER, A., CLARKE, B., AND STRAZDINS, P. 2007. A comparison of two approaches to parallel simulation of multiprocessors. ispass 0, 12–22. PATIL, H., COHN, R., CHARNEY, M., KAPOOR, R., SUN, A., AND KARUNANIDHI, A. 2004. Pinpointing representative portions of large Intel R R programs with dynamic instru- mentation. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO’04). IEEE Computer Society, 81–92. PELLAUER, M., VIJAYARAGHAVAN, M., ADLER, M., AND EMER, J. 2008. Quick performance models quickly: Timing-Directed simulation on FPGAs. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. PENRY, D., FAY, D., HODGDON, D., WELLS, R., SCHELLE, G., AUGUST, D., AND CONNORS, D. 2006. Exploiting parallelism and structure to accelerate the simulation of chip multi-processors.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009. 15: 32 · E. S. Chung et al.

In Proceedings of the 12th International Symposium on High-Performance Computer Architec- ture, 29–40. REINHARDT, S. K., HILL, M. D., LARUS, J. R., LEBECK, A. R., LEWIS, J. C., AND WOOD, D. A. 1993. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. ACM SIGMETRICS Perform. Eval. Rev. 21, 1, 48–60. ROSENBLUM,M.,HERROD,S.A.,WITCHEL,E., AND GUPTA, A. 1995. Complete computer system simulation: The SimOS approach. IEEE Parallel Distrib. Technol. 3, 4, 34–43. SMITH, B. 1985. In The Architecture of HEP on Parallel MIMD Computation: HEP Supercomputer and its Applications. Massachusetts Institute of Technology, Cambridge, MA, 41–55. SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A system for building customized program analy- sis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’94). ACM, New York, 196–205. TAN,Z.,ASANOVIC´ , K., AND PATTERSON, D. 2008. An FPGA host-multithreaded functional model for SPARC v8. In Proceedings of the 3rd Workshop on Architectural Research Prototyping. THORNTON, J. E. 1995. Parallel operation in the control data 6600. 5–12. VAHIA, D. AND HARTKE, P. 2007. OpenSPARC T1 on Xilinx FPGAs–Updates. June 2007 RAMP Retreat. VENKATARAMANI, G., ROEMER, B., SOLIHIN, Y., AND PRVULOVIC, M. 2007. MemTracker: Efficient and programmable support for memory access monitoring and debugging. In Proceed- ings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, 273–284. WANG,K.,ZHANG, Y., WANG, H., AND SHEN, X. 2008. Parallelization of IBM mambo system simulator in functional modes. SIGOPS Oper. Syst. Rev. 42, 1, 71–76. WAWRZYNEK, J., PATTERSON, D., OSKIN,M., LU, S.-L., KOZYRAKIS,C., HOE, J. C., CHIOU, D., AND ASANOVIC´ , K. 2007. RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2, 46–57. WEE, S., CASPER, J., NJOROGE, N., TESYLAR, Y., GE, D., KOZYRAKIS, C., AND OLUKOTUN, K. 2007. A practical FPGA-based framework for novel CMP research. In Proceedings of the ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (FPGA’07). ACM, New York, 116–125. WENISCH, T. AND WUNDERLICH, R. 2005. SimFlex: Fast, accurate and flexible simulation of computer systems. In Proceedings of the Tutorial in the International Symposium on Microar- chitecture (MICRO-38). WENISCH, T. F., WUNDERLICH,R.E.,FERDMAN,M.,AILAMAKI,A.,FALSAFI, B., AND HOE, J. C. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4, 18–31. WITCHEL, E. AND ROSENBLUM, M. 1996. Embra: Fast and flexible machine simulation. ACM SIGMETRICS Perform. Eval. Rev. 24, 1, 68–79. YOURST, M. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 23–34.

Received June 2008; revised September 2008; accepted November 2008

ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 2, Article 15, Pub. date: June 2009.