<<

Heracles: A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors

Michel A. Kinsy Srinivas Devadas Michael Pellauer Department of Electrical Engineering and Intel Corporation Science VSSAD Group Massachusetts Institute of Technology [email protected] mkinsy, [email protected]

ABSTRACT 1. INTRODUCTION This paper presents Heracles, an open-source, functional, The ability to integrate various computation components parameterized, synthesizable multicore system toolkit. Such such as processing cores, memories, custom hardware units, a multi/many-core design platform is a powerful and versa- and complex network-on-chip (NoC) communication proto- tile research and teaching tool for architectural exploration cols onto a single chip has significantly enlarged the design and hardware-software co-design. The Heracles toolkit com- space in multi/many-core systems. The design of these sys- prises the soft hardware (HDL) modules, application com- tems requires tuning of a large number of parameters in piler, and graphical user interface. It is designed with a high order to find the most suitable hardware configuration, in degree of modularity to support fast exploration of future terms of performance, area, and energy consumption, for multicore processors of different topologies, routing schemes, a target application domain. This increasing complexity processing elements (cores), and memory system organiza- makes the need for efficient and accurate design tools more tions. It is a component-based framework with parameter- acute. ized interfaces and strong emphasis on module reusability. There are two main approaches currently used in the de- The compiler toolchain is used to map C or C++ based ap- sign space exploration of multi/many-core systems. One ap- plications onto the processing units. The GUI allows the proach consists of building software routines for the differ- user to quickly configure and launch a system instance for ent system components and simulating them to analyze sys- easy factorial development and evaluation. Hardware mod- tem behavior. Software simulation has many advantages: ules are implemented in synthesizable Verilog and are FPGA i) large programming tool support; ii) internal states of platform independent. The Heracles tool is freely available all system modules can be easily accessed and altered; iii) under the open-source MIT license at: compilation/re-compilation is fast; and iv) less constraining http://projects.csail.mit.edu/heracles. in terms of number of components (e.g., number of cores) to simulate. Some of the most stable and widely used soft- Categories and Subject Descriptors ware simulators are Simics [14]–a commercially available full-system simulator–GEMS [21], Hornet [12], and Graphite C.1.2 [Computer Systems Organization]: Processor Ar- [15]. However, software simulation of many-core architec- chitecture - Single-instruction-stream, multiple-data-stream tures with cycle- and bit-level accuracy is time-prohibitive, processors (SIMD); B.5.1 [Hardware]: Register-Transfer- and many of these systems have to trade off evaluation ac- Level Implementation- Design. curacy for execution speed. Although such a tradeoff is fair and even desirable in the early phase of the design explo- General Terms ration, making final micro-architecture decisions based on Tool, Design, Experimentation, Performance these software models over truncated applications or appli- cation traces leads to inaccurate or misleading system char- acterization. Keywords The second approach used, often preceded by software Multicore Architecture Design, RTL-Based Design, FPGA, simulation, is register-transfer level (RTL) simulation or em- Shared Memory, Distributed Shared Memory, Network-on- ulation. This level of accuracy considerably reduces system Chip, RISC, MIPS, Hardware Migration, Hardware multi- behavior mis-characterization and helps avoid late discovery threading, Virtual Channel, Wormhole Router, NoC Rout- of system performance problems. The primary disadvantage ing . of RTL simulation/emulation is that as the design size in- creases so does the simulation time. However, this problem can be circumvented by adopting synthesizable RTL and us- ing hardware-assisted accelerators–field programmable gate Permission to make digital or hard copies of all or part of this work for arrays (FPGAs)–to speed up system execution. Although personal or classroom use is granted without fee provided that copies are FPGA resources constrain the size of design one can im- not made or distributed for profit or commercial advantage and that copies plement, recent advances in FPGA-based design method- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific ologies have shown that such constraints can be overcome. permission and/or a fee. HAsim [18], for example, has shown using its time multiplex- FPGA’13, February 11-13, 2013, Monterey, California, USA. ing technique how one can model a shared-memory multicore Copyright 2013 ACM 978-1-4503-1887-7/13/02 ...$15.00. system including detailed core pipelines, cache hierarchy, RTL model for evaluating power and performance of NoC and on-chip network, on a single FPGA. RAMP Gold [20] architectures is presented in Banerjee et al [2]. Other designs is able to simulate a 64-core shared-memory target machine make use of multiple FPGAs. H-Scale [19], by Saint-Jean capable of booting real operating systems running on a sin- et al, is a multi-FPGA based homogeneous SoC, with RISC gle Xilinx Virtex-5 FPGA board. Fleming et al [7] propose a processors and an asynchronous NoC. The S-Scale version mechanism by which complex designs can be efficiently and supports a multi-threaded sequential programming model automatically partitioned among multiple FPGAs. with dedicated communication primitives handled at run- RTL design exploration for multi/many-core systems time by a simple operating system. nonetheless remain unattractive to most researchers because it is still a time-consuming endeavor to build such large de- signs from the ground up and ensure correctness at all levels. 3. HERACLES HARDWARE SYSTEM Furthermore, researchers are generally interested in one key Heracles presents designers with a global and complete system area, such as processing core and/or memory organi- view of the inner workings of the multi/many-core system zation, network interface, interconnect network, or operating at cycle-level granularity from instruction fetches at the pro- system and/or application mapping. Therefore, we believe cessing core in each node to the flit arbitration at the routers. that if there is a platform-independent design framework, It enables designers to explore different implementation pa- more specifically, a general hardware toolkit, which allows rameters: core micro-architecture, levels of caches, cache designers to compose their systems and modify them at will sizes, routing algorithm, router micro-architecture, distribu- and with very little effort or knowledge of other parts of the ted or shared memory, or network interface, and to quickly system, the speed versus accuracy dilemma in design space evaluate their impact on the overall system performance. It exploration of many-core systems can be further mitigated. is implemented with user-enabled performance counters and To that end we present Heracles, a functional, modular, probes. synthesizable, parameterized multicore system toolkit. It is a powerful and versatile research and teaching tool for 3.1 System overview architectural exploration and hardware-software co-design. Without loss in timing accuracy and logic, complete sys- tems can be constructed, simulated and/or synthesized onto Application (single or multi-threaded C or C++) FPGA, with minimal effort. The initial framework is pre- sented in [10]. Heracles is designed with a high degree of Software MIPS-based Linux GNU GCC cross compiler Environment modularity to support fast exploration of future multicore processors–different topologies, routing schemes, processing Hardware config-aware application mapping elements or cores, and memory system organizations by us- ing a library of components, and reusing user-defined hard- ware blocks between different system configurations or pro- Component-based Processing Memory Network-on-chip jects. It has a compiler toolchain for mapping applications Hardware Design elements organization Topology and written in C or C++ onto the core units. The graphical selection configuration routing settings user interface (GUI) allows the user to quickly configure and launch a system instance for easily-factored develop- ment and evaluation. Hardware modules are implemented Evaluation in synthesizable Verilog and are FPGA platform indepen- Environment RTL-level simulation FPGA-based Emulation dent. Figure 1: Heracles-based design flow. 2. RELATED WORK In [6] Del Valle et al present an FPGA-based emulation Figure 1 illustrates the general Heracles-based design flow. framework for multiprocessor system-on-chip (MPSoC) ar- Full applications–written in single or multithreaded C or chitectures. LEON3, a synthesizable VHDL model of a 32- C++–can be directly compiled onto a given system instance bit processor compliant with the SPARC V8 architecture, using the Heracles MIPS-based GCC cross compiler. The has been used in implementing multiprocessor systems on detailed compilation process and application examples are FPGAs. Andersson et al [1], for example, use the LEON4FT presented in Section 5. For a multi/many-core system, we microprocessor to build their Next Generation Multipurpose take a component-based approach by providing clear inter- Microprocessor (NGMP) architecture, which is prototyped faces to all modules for easy composition and substitutions. on the Xilinx XC5VFX130T FPGA board. However, the The system has multiple default settings to allow users to LEON architecture is fairly complex, and it is difficult to in- quickly get a system running and only focus on their area of stantiate more than two or three on a medium-sized FPGA. interest. System and application binary can be executed in Clack et al [4] investigate the use of FPGAs as a prototyping an RTL simulated environment and/or on an FPGA. Fig- platform for developing multicore system applications. They ure 2 shows two different views of a typical network node use the Xilinx MicroBlaze processor for the core, and a bus structure in Heracles. protocol for the inter-core communication. Some designs focus primarly on the Network-on-chip (NoC). Lusala et al 3.2 Processing Units [13], for example, propose a scalable implementation of NoC In the current version of the Heracles design framework, on FPGA using a torus topology. Genko et al [8] also present users can instantiate four different types of processor cores, an FPGA-based flexible emulation environment for explor- or any combination thereof, depending on the programming ing different NoC features. A VHDL-based cycle-accurate model adopted and architectural evaluation goals. migration and evictions (mCore). It is the user’s responsi- Memory System_Wrapper Memory System Wrapper bility to guarantee deadlock-freedom under this core config- Caches Processing Core uration. One approach is to allocate local memory to con- Router & network interface Address Translaon Logic texts so on migration they are removed from the network. Network interface Local Main Another approach which requires no additional hardware Memory modification to the core, is using Cho et al [3] deadlock-free Memory Subsystem Packezer & thread migration scheme. Router Router

Router & Network interface 3.2.5 FPGA Synthesis Data All the cores have the same interface, they are self-contained (a) Expanded view of local memory structure (b) Expanded view of the roung structure and oblivious to the rest of the system, and therefore eas- Figure 2: Network node structure. ily interchangeable. The cores are synthesized using Xilinx ISE Design Suite 11.5, with Virtex-6 LX550T package ff1760 3.2.1 Injector Core speed -2, as the targeted FPGA board. The number of slice The injector core (iCore) is the simplest processing unit. registers and slice lookup tables (LUTs) on the board are It emits and/or collects from the network user-defined data 687360 and 343680 respectively. Table 1 shows the regis- streams and traffic patterns. Although it does not do any ter and LUT utilization of the different cores. The two- useful computation, this type of core is useful when the user way hardware-threaded core with migration consumes the is only focusing on the network on-chip behavior. It is use- most resources and is less than 0.5%. Table 1 also shows ful in generating network traffic and allowing the evaluation the clocking speed of the cores. The injector core, which of network congestion. Often, applications running on real does no useful computation, runs the fastest at 500.92MHz cores fail to produce enough data traffic to saturate the net- whereas the two-way hardware-threaded core runs the slow- work. est at 118.66MHz. 3.2.2 Single Hardware-Threaded MIPS Core Table 1: FPGA resource utilization per core type. This is an integer 7-stage 32-bit MIPS–Microprocessor Core type iCore sCore dCore mCore without Interlocked Pipeline Stages–Core (sCore). This RISC Registers 227 1660 2875 3484 architecture is widely used in commercial products and for LUTs 243 3661 5481 6293 teaching purposes [17]. Most users are very familiar with Speed (MHz) 500.92 172.02 118.66 127.4 this architecture and its operation, and will be able to easily modify it when necessary. Our implementation is generally 3.3 Memory System Organization standard with some modifications for FPGAs. For example, The memory system in Heracles is parameterized, and the adoption of a 7-stage pipeline, due to block RAM access can be set up in various ways, independent of the rest of time on the FPGA. The architecture is fully bypassed, with the system. The key components are main memory, caching no branch prediction table or branch delay slot, running system, and network interface. MIPS-III instruction set architecture (ISA) without floating point. Instruction and data caches are implemented using 3.3.1 Main Memory Configuration block RAMs, and instruction fetch and data memory access The main memory is constructed to allow different mem- take two cycles. Stall and bypass signals are modified to ory space configurations. For Centralized Shared Memory support the extended pipeline. Instructions are issued and (CSM) implementation, all processors share a single large executed in-order, and the data memory accesses are also main memory block; the local memory size (shown in Fig- in-order. ure 2) is simply set to zero at all nodes except one. In Distributed Shared Memory (DSM), where each processing 3.2.3 Two-way Hardware-Threaded MIPS Core element has a local memory, the local memory is parame- A fully functional fine-grain hardware multithreaded MIPS terized and has two very important attributes: the size can core (dCore). There are two hardware threads in the core. be changed on a per core-basis, providing support for both The execution datapath for each thread is similar to the uniform and non-uniform distributed memory, and it can single-threaded core above. Each of the two threads has its service a variable number of caches in a round-robin fash- own context which includes a program counter (PC), a set ion. Figure 3 illustrates these physical memory partitions. of 32 data registers, and one 32-bit state register. The core The fact that the local memory is parameterized to handle can dispatch instructions from any one of hardware contexts requests from a variable number of caches allows the traffic and supports precise interrupts (doorbell type) with limited coming into a node from other cores through the network state saving. A single hardware thread is active on any given to be presented to local memory as just another cache com- cycle, and pipeline stages must be drained between context munication. This illusion is created through the network switches to avoid state corruption. The user has the ability packetizer. Local memory can also be viewed as a mem- to control the context switching conditions, e.g., minimum ory controller. Figure 4 illustrates the local structure of the number of cycles to allocate to each hardware thread at a memory sub-system. The LOCAL ADDR BITS parameter is time, instruction or data cache misses. used to set the size of the local memory. The Address Trans- lation Logic performs the virtual-to-physical address lookup 3.2.4 Two-way Hardware-Threaded MIPS Core with using the high-order bits, and directs cache traffic to local Migration memory or network. The fourth type of core is also a two-way hardware-threaded For cache coherence, a directory is attached to each lo- processor but enhanced to support hardware-level thread cal memory and the MESI protocol is implemented as the LM LM

LM Uniform distributed memory Non-uniform distributed memory Non-distributed memory Figure 3: Possible physical memory configurations.

from 6 to 8, resource utilization and speed remain identical. Memory System_Wrapper ICache DCache Status Tag Data Block Status Tag Data Block Meanwhile if cache size is increased to 8KB by changing Caches the OFFSET BITS parameter from 3 to 5, resource utiliza- tion increases dramatically: 15 FPGA block RAMs, 1232 Address Translaon Logic slice registers, 3397 slice LUTs, and speed is 226.8MHz. FPGA-based cache design favors large number of blocks of Local Main 1 Memory size versus small number of blocks of large size .

Packezer Directory Data Block Status PE IDs 3.3.3 Network Interface

Directory Manager The Address Resolution Logic works with the Packetizer Router & Network interface module, shown in Figure 2, to get the caches and the local memory to interact with the rest of the system. All cache traffic goes through the Address Resolution Logic, which de- Figure 4: Local memory sub-system structure. termines if a request can be served at the local memory, or if the request needs to be sent over the network. The default coherence mechanism. Remote access (RA) is also Packetizer is responsible for converting data traffic, such as supported. In RA mode, the network packetizer directly a load, coming from the local memory and the cache system sends network traffic to the caches. Memory structures are into packets or flits that can be routed inside the Network- implemented in FPGA using block RAMs. There are 632 on-chip (NoC), and for reconstructing packets or flits into block RAMs on the Virtex-6 LX550T. A local memory of data traffic at the opposite side when exiting the NoC. 0.26MB uses 64 block RAMs or 10%. Table 2 shows the FPGA resource used to provide the two cache coherence 3.3.4 Hardware multithreading and caching mechanisms. The RA scheme uses less hardware resources In this section, we examine the effect of hardware mul- than the cache-coherence-free structure, since no cache-line tithreading (HMT) on system performance. We run the buffering is needed. The directory-based coherence is far 197.parser application from the SPEC CINT2000 bench- more complex resulting in more resource utilization. The marks on a single node with the dCore as the processing SHARERS parameter is used to set the number of sharers unit using two different inputs–one per thread–with five dif- per data block. It also dictates the overall size of the lo- ferent execution interleaving policies: cal memory directory size. When a directory entry cannot • setup 1: threads take turns to execute every 32 cycles; handle all sharers, other sharers are evicted. on a context switch, the pipeline is drained before the Table 2: FPGA resource utilization per coherence execution of another thread begins. mechanism. • setup 2: thread switching happens every 1024 cycles. • setup 3: thread context swapping is initiated on an Coherence None RA Directory instruction or a data miss at the Level 1 cache. Registers 2917 2424 11482 • setup 4: thread interleaving occurs only when there is LUTs 5285 4826 17460 a data miss at the Level 1 cache. Speed (MHz) 238.04 217.34 171.75 • setup 5: thread switching happens when there is a data miss at the Level 2 cache. 3.3.2 Caching System Figure 5 shows the total completion time of the two threads The user can instantiate direct-mapped Level 1 or Levels (in terms of number of cycles). It is worth noting that 1 and 2 caches with the option of making Level 2 an inclusive even with fast fine-grain hardware context switching, mul- cache. The INDEX BITS parameter defines the number of tithreading is most beneficial for large miss penalty events blocks or cache-lines in the cache where the OFFSET BITS like Level 2 cache misses or remote data accesses. parameter defines block size. By default, cache and memory 3.4 Network-on-Chip (NoC) structures are implemented in FPGA using block RAMs, but user can instruct Heracles to use LUTs for caches or To provide scalability, Heracles uses a network-on-chip some combination of LUTs and block RAMs. A single 2KB (NoC) architecture for its data communication infrastruc- cache uses 4 FPGA block RAMs, 462 slice registers, 1106 ture. A NoC architecture is defined by its topology (the slice LUTs, and runs at 228.8MHz. If cache size is in- 1Cache-line size also has traffic implications at the network creased to 8KB by changing the INDEX BITS parameter level Execuon Cycles configurations of the buffered router. It also shows the effect of virtual channels on router clocking speed. The key take- 8 away is that a larger number of VCs at the router increases 7 both the router resource utilization and the critical path. 6 )

9 Table 3: FPGA resource utilization per router con- 5

(x10 figuration. 4 Thread1 3 Thread2 Number of VCs Bufferless 2 VCs 4 VCs 8 VCs Cycles 2 Registers 175 4081 7260 13374 1 LUTs 328 7251 12733 23585 0 Speed (MHz) 817.18 111.83 94.8 80.02 32 1024 miss (L1) d-miss (L1) d-miss (L2) Figure 5: Effects of hardware multithreading and caching. 3.4.2 Routing algorithm used to compute routes in network-on-chip physical organization of nodes in the network), its flow con- (NoC) architectures, generally fall under two categories: obliv- trol mechanism (which establishes the data formatting, the ious and dynamic [16]. The default routers in Heracles switching protocol and the buffer allocation), and its routing primarily support oblivious routing algorithms using either algorithm (which determines the path selected by a packet fixed logic or routing tables. Fixed logic is provided for to reach its destination under a given application). dimension-order routing (DOR) algorithms, which are widely used and have many desirable properties. On the other 3.4.1 Flow control hand, table-based routing provides greater programmabil- Routing in Heracles can be done using either bufferless or ity and flexibility, since routes can be pre-computed and buffered routers. Bufferless routing is generally used to re- stored in the routing tables before execution. Both buffered duce area and power overhead associated with buffered rout- and bufferless routers can make usage of the routing tables. ing. Contention for physical link access is resolved by either Heracles provides support for both static and dynamic vir- dropping and retransmitting or temporarily misrouting or tual channel allocation. deflecting of flits. With flit dropping an acknowledgment mechanism is needed to enable retransmission of lost flits. 3.4.3 Network Topology Configuration With flit deflection, a priority-based arbitration, e.g., age- The parameterization of the number of input ports and based, is needed to avoid livelock. In Heracles, to mitigate output ports on the router and the table-based routing ca- some of the problems associated with the lossy bufferless pability give Heracles a great amount of flexibility and the routing, namely retransmission and slow arbitration logic, ability to metamorphose into different network topologies; we supplement the arbiter with a routing table that can be for example, k-ary n-cube, 2D-mesh, 3D-mesh, hypercube, statically and off-line configured on a per-application basis. ring, or tree. A new topology is constructed by chang- ing the IN PORTS, OUT PORTS, and SWITCH TO SWITCH parameters and reconnecting the routers. Table 4 shows

Memory System Wrapper the clocking speed of a bufferless router, a buffered router with strict round-robin arbitration (Arbiter1), a buffered

Router & network interface Roung Logic And Table router with weak round-robin arbitration (Arbiter2), and Xbar Network interface a buffered router with 7 ports for a 3D-mesh network. The bufferless router runs the fastest at 817.2MHz, Arbiter1 and Arbiter2 run at the same speed (∼ 112), although the arbi- Virtual Channels tration scheme in Arbiter2 is more complex. The 7-port Router router runs the slowest due to more complex arbitration

Arbiter logic. Table 4: Clocking speed of different router types. Figure 6: Virtual channel based router architecture. Router type Bufferless Arbiter1 Arbiter2 7-Port Speed (MHz) 817.18 111.83 112.15 101.5 The system default virtual-channel router conforms in its architecture and operation to conventional virtual-channel Figure 7 shows a 3×3 2D-mesh with all identical routers. routers [5]. It has some input buffers to store flits while they Figure 8 depicts an unbalanced fat-tree topology. For a fat- are waiting to be routed to the next hop in the network. tree [11] topology, routers at different levels of the tree have The router is modular enough to allow user to substitute different sizes, in terms of crossbar and arbitration logic. different arbitration schemes. The routing operation takes The root node contains the largest router, and controls the four steps or phases, namely routing (RC), virtual-channel clock frequency of the system. allocation (VA), switch allocation (SA), and switch traversal (ST), where each phase corresponds to a pipeline stage in our router. Figure 6 depicts the general structure of the buffered 4. HERACLES PROGRAMMING MODELS router. In this router the number of virtual channels per A programming model is inherently tied to the underly- port and their sizes are controlled through VC PER PORT ing hardware architecture. The Heracles design tool has no and VC DEPTH parameters. Table 3 shows the register and directly deployable operating system, but it supports both LUT utilization of the bufferless router and different buffer sequential and parallel programming models through various Inacve Acve Inacve PC à A (a) B A C SP à A DATA à A

Core Starng PC Acve Acve Inacve Roung Table Data PC à B (b) B A C SP à A Node DATA à A

Acve Acve Acve PC à B (c) B A C SP à C DATA à A

Figure 7: 2D mesh topology. Figure 9: Examples of memory space management for sequential programs.

• to execute one program across multiple cores by mi- grating the program from one core to another. Root 4.2 Parallel programming model The Heracles design platform supports both hardware multi- threading and software multi-threading. The keywords Core Starng PC HHThread1 and HHThread2 are provided to specify the part Roung Table Data of the program to execute on each hardware thread. Multi- ple programs can also be executed on the same core using the same approach. An example is shown below:

... int HHThread1 (int *array, int item, int size) { sort(array, size); int output = search(array, item, size); return output; } Figure 8: Unbalanced fat-tree topology. int HHThread2 (int input,int src,int aux,int dest){ application programming interface (API) protocols. There int output = Hanoi(input, src, aux, dest); are three memory spaces associated with each program: in- return output; struction memory, data memory, and stack memory. Dy- } namic memory allocation is not supported in the current ... version of the tool. Below is the associated binary outline: 4.1 Sequential programming model In the sequential programming model, a program has a @e2 // single entry point (starting program counter–PC) and single 27bdffd8 // 00000388 addiu sp,sp,-40 execution thread. Under this model, a program may exhibit ... any of the follow behavior or a combination thereof: 0c00004c // 000003c0 jal 130 • the local memory of executing core has instruction bi- ... nary; • PC pointer to another core’s local memory where the @fa // instruction binary resides; 27bdffd8 // 000003e8 addiu sp,sp,-40 • the stack frame pointer– SP points to the local memory ... of executing core; 0c0000ab // 00000418 jal 2ac • SP points to another core’s local memory for the stor- ... age of the stack frame; • program data is stored at the local memory; @110 // • program data is mapped to another core local memory. 27bdffa8 // 00000440 addiu sp,sp,-88 ... Figure 9 gives illustrating examples of the memory space 0c0000e2 // 000004bc jal 388 management when dealing with sequential programs. These ... techniques provide the programming flexibility needed to support the different physical memory configurations. They @15e // also allow users: 27bdffa8 // 00000440 addiu sp,sp,-88 • to run the same program on multiple cores (program ... inputs can be different); in this setup the program bi- 0c0000fa // 000004d8 jal 3e8 nary is loaded to one core and the program counter at ... other cores points to the core with the binary; Heracles uses OpenMP style pragmas to allow users to Below is the associated binary outline: directly compile multi-threaded programs onto cores. Users specify programs or parts of a program that can be exe- @12b // cuted in parallel and on which cores. Keywords HLock and 27bdffe8 // 000004ac addiu sp,sp,-24 HBarrier are provided for synchronization and shared vari- ... ables are encoded with the keyword HGlobal. An example 0c000113 // 000004bc jal 44c is shown below: ...... #pragma Heracles core 0 { 5. PROGRAMMING TOOLCHAIN // Synchronizers HLock lock1, lock2; 5.1 Program compilation flow HBarrier bar1, bar2; The Heracles environment has an open-source compiler toolchain to assist in developing software for different sys- // Variables tem configurations. The toolchain is built around the GCC HGlobal int arg1, arg2, arg3; MIPS cross-compiler using GNU C version 3.2.1. Figure 10 HGlobal int Arr[16][16]; depicts the software flow for compiling a C program into the HGlobal int Arr0[16][16] = { { 1, 12, 7, 0,... compatible MIPS instruction code that can be executed on HGlobal int Arr1[16[16] = { { 2, 45, 63, 89,... the system. The compilation process consists of a series of six steps. // Workers #pragma Heracles core 1 { • First, the user invokes mips-gcc to translate the C start_check(50); code into (e.g., ./mips-gcc -S fi- check_lock(&lock1, 1000); bonacci.c). matrix_add(Arr, Arr0, Arr1, arg1); • In step 2, the assembly code is then run through the clear_barrier(&bar1); isa-checker (e.g., ./checker fibonacci.s). The checker’s } role is to: (1) remove all memory space primitives, (2) replace all pseudo-instructions, and (3) check for #pragma Heracles core 2 { floating point instructions. Its output is a .asm file. start_check(50); • For this release, there is no direct high-level operat- check_lock(&lock1, 1000); ing system support. Therefore, in the third compila- matrix_add(Arr, Arr0, Arr1, arg2); tion stage, a small kernel-like assembly code is added clear_barrier(&bar2); to the application assembly code for memory space } management and workload distribution (e.g., ./linker } fibonacci.asm). Users can modify the linker.cpp file ... provided in the toolchain to reconfigure the memory space and workload. • In step 4, the user compiles the assembly file into an Below is the intermediate C representation: object file using the cross-compiler. This is accom- ... plished by executing mips-as on the .asm file (e.g., int core_0_lock1, core_0_lock2; ./mips-as fibonacci.asm). int core_0_bar1, core_0_bar2; • In step 5, the object file is disassembled using the mips- int core_0_arg1, core_0_arg2, core_0_arg3; objdump command (e.g., ./mips-objdump fibonacci.o). int core_0_Arr[16][16]; Its output is a .dump file. int core_0_Arr0[16][16] = { { 1, 12, 7, 0,... • Finally, the constructor script is called to transform int core_0_Arr1[16][16] = { { 2, 45, 63, 89,... the dump file into a Verilog memory, .vmh, file format (e.g., ./dump2vmh fibonacci.dump). void core_0_work (void) { If the program is specified using Heracles multi-threading // Synchronizers format (.hc or .hcc), c-cplus-generator (e.g., ./c-cplus-genera- // Variables tor fibonacci.hc) is called to first get the C or C++ program // Workers file before executing the steps listed above. The software // Main function toolchain is still evolving. All these steps are also automated main(); through the GUI. } void core_1_work (void) 5.2 Heracles graphical user interface { The graphical user interface (GUI) is called Heracles De- start_check(50); signer. It helps to quickly configure and launch system con- check_lock(&core_0_lock1, 1000); figurations. Figure 11 shows a screen shot of the GUI. On matrix_add(core_0_Arr, core_0_Arr0, core_0_Arr1, the core tab, the user can select: (1) the type of core to core_0_arg1); generate, (2) the network topology of the system instance clear_barrier(&core_0_bar1); to generate, (3) the number of cores to generate, (4) traf- } fic type, injection rate, and simulation cycles in the case ... of an injector core, or (5) different pre-configured settings. C-cplus- Heracles code file Start generator e.g., fibonacci.hc

Architecture Compiler Assembly Compatibility Workload Compatible Start Source code file [mips-gcc] code file checker Distribution Memory-aware assembly code e.g., fibonacci.c e.g., fibonacci.s [isa-checker] (Automatic for assembly code e.g., fibonacci.asm .hc or .hcc) e.g., fibonacci.asm

Verilog hex Dump file End memory file Constructor e.g., Diassembler Object code file Assembler e.g., fibonacci.mem [dump2vmh] fibonacci.dump [mips-objdump] e.g., fibonacci.o [mips-as]

Figure 10: Software toolchain flow.

Figure 11: Heracles designer graphical user interface.

Generate and Run buttons on this tab are used to automat- the different systems in terms of registers, lookup tables, and ically generate the Verilog files and to launch the synthesis block RAMs. In the 2×2 and 3×3 configurations, the local process or specified simulation environment. The second memory is set to 260KB per core. The 3×3 configuration tab–memory system tab–allows the user to set: (1) maim uses 99% of block RAM resources at 260KB of local memory memory configuration (e.g., Uniformed Distributed), (2) to- per core. For the 4×4 configuration the local memory is tal main memory size, (3) instruction and data cache sizes, reduced to 64KB per core, and the local memory in the 5×5 (4) Level 2 cache, and (5) FPGA favored resource (LUT configuration is set to 32KB. The 6×6 configuration, with or block RAM) for cache structures. The on-chip network 16KB of local memory per core, fails during the mapping tab covers all the major aspects of the system interconnect: and routing synthesis steps, due to the lack of LUTs. (1) routing algorithm, (2) number of virtual channels (VCs) per port, (3) VC depth, (4) core and switch bandwidths, (5) 6.2 Evaluation results routing tables programming, by selecting source/destination We examine the performance of two SPEC CINT2000 pair or flow ID, router ID, output port, and VC (allow- benchmarks, namely, 197.parser and 256.bzip2 on Heracles. ing user-defined routing paths), and (6) number of flits per We modify and parallelize these benchmarks to fit into our packet for injector-based traffic. The programming tab is evaluation framework. For the 197.parser benchmark, we updated when the user changes the number of cores in the identify three functional units: file reading and parameters system; the user can: (1) load a binary file onto a core, (2) setting as one unit, actual parsing as a second unit, and er- load a binary onto a core and set the starting address for an- Table 5: 2D-mesh system architecture details. other core to point to that binary, (3) select where to place the data section or stack pointer of a core (it can be local, Core on the same core as the binary or on another core), and (4) ISA 32-Bit MIPS select which cores to start. Hardware threads 1 Pipeline Stages 7 Bypassing Full 6. EXPERIMENTAL RESULTS Branch policy Always non-Taken Outstanding memory requests 1 6.1 Full 2D-mesh systems Level 1 Instruction/Data Caches Associativity Direct The synthesis results of five multicore systems of size: Size variable 2×2, 3×3, 4×4, 5×5, and 6×6 arranged in 2D-mesh topology Outstanding Misses 1 are summarized below. Table 5 gives the key architectural On-Chip Network characteristics of the multicore system. All five systems run Topology 2D-Mesh at 105.5MHz, which is the clock frequency of the router, Routing Policy DOR and Table-based regardless of the size of the mesh. Virtual Channels 2 Figure 12 summarizes the FPGA resource utilization by Buffers per channel 8

(a) 197.parser (a) 197.parser (b) 256.bzip2 (c) (c) Fibonacci Fibonacci Figure 13: Effect of memory organization on performance for the different applications.

(a) 197.parser (a) 197.parser (b) 256.bzip2 (c) (c) Fibonacci Fibonacci Figure 14: Effect of routing algorithm on performance in 2D-mesh for the different applications. Resource Ulizaon ror reporting as the third unit. When there are more than three cores, all additional cores are used in the parsing unit. Similarly, 256.bzip2 is divided into three functional units: 1 file reading and cyclic redundancy check, compression, and 0.8 output file writing. The compression unit exhibits a high 0.6 Registers degree of data-parallelism, therefore we apply all additional cores to this unit for core count greater than three. We also 0.4 LUTs

Percentage present a brief analysis of a simple Fibonacci number calcu- 0.2 Block RAMs lation program. Figures 13 (a), (b), and (c) show 197.parser, 0 256.bzip2, and Fibonacci benchmarks under single shared- 2x2 3x3 4x4 5x5 6x6* memory (SSM) and distributed shared-memory (DSM), us- 2D Meshes ing XY-Ordered routing. Increasing the number of cores Figure 12: Percentage of FPGA resource utilization improves performance for both benchmarks; it also exposes per mesh size. the memory bottleneck encountered in the single shared-

18 memory scheme. Figures 14 (a), (b), and (c) highlight the impact of the routing algorithm on the overall system per- 16 formance, by comparing completion cycles of XY-Ordered 14 routing and BSOR [9]. BSOR, which stands for Bandwidth- 12 Sensitive Oblivious Routing, is a table-based routing algo- 10 4 cores rithm that minimizes the maximum channel load (MCL) 8 9 cores (or maximum traffic) across all network links in an effort to 6 16 cores maximize application throughput. The routing algorithm Execuon cycles (10^8) 4 has little or no effect on the performance of 197.parser and 2 256.bzip2 benchmarks, because of the traffic patterns in 0 these applications. For the Fibonacci application, Figure 16x16 32x32 64x64 80x80 100x100 128x128 14 (c), BSOR routing does improve performance, particu- Matrix sizes larly with 5 or more cores. To show the multithreading and Figure 15: Matrix multiplication acceleration. scalability properties of the system, Figure 15 presents the execution times for matrix multiplication given matrices of network-on-chip emulation framework. In Design, different size. The Heracles multicore programming format Automation and Test in Europe, 2005. Proceedings, is used to automate of the workload distribution onto cores. pages 246–251 Vol. 1, march 2005. [9] M. Kinsy, M. H. Cho, T. Wen, E. Suh, M. van Dijk, 7. CONCLUSION and S. Devadas. Application-Aware Deadlock-Free Oblivious Routing. In Proceedings of the Int’l In this work, we present the new Heracles design toolkit Symposium on Computer Architecture, June 2009. which is comprised of soft hardware (HDL) modules, an application compiler toolchain, and a graphical user inter- [10] M. Kinsy, M. Pellauer, and S. Devadas. Heracles: face. It is a component-based framework that gives re- Fully synthesizable parameterized mips-based searchers the ability to create complete, realistic, synthesiz- multicore system. In Field Programmable Logic and able, multi/many-core architectures for fast, high-accuracy Applications (FPL), 2011 International Conference design space exploration. In this environment, user can ex- on, pages 356 –362, sept. 2011. plore design tradeoffs at the processing unit level, the mem- [11] C. E. Leiserson. Fat-trees: universal networks for ory organization and access level, and the network on-chip hardware-efficient supercomputing. IEEE Trans. level. The Heracles tool is open-source and can be down- Comput., 34(10):892–901, 1985. loaded at http://projects.csail.mit.edu/heracles. In the cur- [12] M. Lis, P. Ren, M. H. Cho, K. S. Shim, C. Fletcher, rent release, RTL hardware modules can be simulated on all O. Khan, and S. Devadas. Scalable, accurate multicore operating systems, the MIPS GCC cross-compiler runs in a simulation in the 1000-core era. In Performance Linux environment, and the graphical user interface has a Analysis of Systems and Software (ISPASS), 2011 Windows installer. IEEE International Symposium on, pages 175–185, april 2011. 8. ACKNOWLEDGMENTS [13] A. Lusala, P. Manet, B. Rousseau, and J.-D. Legat. Noc implementation in fpga using torus topology. In We thank the students from the MIT 6.S918 IAP 2012 Field Programmable Logic and Applications, 2007. class who worked with early versions of the tool and made FPL 2007. International Conference on, pages many excellent suggestions for improvement. We thank Joel 778–781, aug. 2007. Emer, Li-Shiuan Peh, and Omer Khan for interesting discus- [14] P. S. Magnusson, M. Christensson, J. Eskilson, sions throughout the course of this work. D. Forsgren, G. H˚allberg, J. H¨ogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system 9. REFERENCES simulation platform. Computer, 35(2):50–58, feb 2002. [1] J. Andersson, J. Gaisler, and R. Weigand. Next [15] J. Miller, H. Kasture, G. Kurian, C. Gruenwald, generation multipurpose microprocessor. 2010. N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. [2] N. Banerjee, P. Vellanki, and K. Chatha. A power and Graphite: A distributed parallel simulator for performance model for network-on-chip architectures. multicores. In High Performance Computer In Design, Automation and Test in Europe Conference Architecture (HPCA), 2010 IEEE 16th International and Exhibition, 2004. Proceedings, volume 2, pages Symposium on, pages 1–12, jan. 2010. 1250 – 1255 Vol.2, feb. 2004. [16] L. M. Ni and P. K. McKinley. A survey of wormhole [3] M. H. Cho, K. S. Shim, M. Lis, O. Khan, and routing techniques in direct networks. Computer, S. Devadas. Deadlock-free fine-grained thread 26(2):62–76, 1993. migration. In Networks on Chip (NoCS), 2011 Fifth [17] D. Patterson and J. Hennessy. Computer Organization IEEE/ACM International Symposium on, pages 33 and Design: The Hardware/software Interface. –40, may 2011. Morgan Kaufmann, 2005. [4] C. R. Clack, R. Nathuji, and H.-H. S. Lee. Using an [18] M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and fpga as a prototyping platform for multi-core J. Emer. Hasim: Fpga-based high-detail multicore processor applications. In WARFP-2005: Workshop simulation using time-division multiplexing. In High on Architecture Research using FPGA Platforms, Performance Computer Architecture (HPCA), 2011 Cambridge, MA, USA, feb. 2005. IEEE 17th International Symposium on, pages [5] W. J. Dally and B. Towles. Principles and Practices of 406–417, feb. 2011. Interconnection Networks. Morgan Kaufmann, 2003. [19] N. Saint-Jean, G. Sassatelli, P. Benoit, L. Torres, and [6] P. Del valle, D. Atienza, I. Magan, J. Flores, E. Perez, M. Robert. Hs-scale: a hardware-software scalable J. Mendias, L. Benini, and G. Micheli. A complete mp-soc architecture for embedded systems. In VLSI, multi-processor system-on-chip fpga-based emulation 2007. ISVLSI ’07. IEEE Computer Society Annual framework. In Very Large Scale Integration, 2006 IFIP Symposium on, pages 21–28, march 2007. International Conference on, pages 140–145, oct. 2006. [20] Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, [7] K. E. Fleming, M. Adler, M. Pellauer, A. Parashar, D. Patterson, and K. Asanovict’ and. Ramp gold: An A. Mithal, and J. Emer. Leveraging fpga-based architecture simulator for multiprocessors. latency-insensitivity to ease multiple fpga design. In In Design Automation Conference (DAC), 2010 47th Proceedings of the ACM/SIGDA international ACM/IEEE, pages 463–468, june 2010. symposium on Field Programmable Gate Arrays, [21] W. Yu. Gems a high performance em simulation tool. FPGA ’12, pages 175–184, New York, NY, USA, 2012. In Electrical Design of Advanced Packaging Systems [8] N. Genko, D. Atienza, G. De Micheli, J. Mendias, Symposium, 2009. (EDAPS 2009). IEEE, pages 1–4, R. Hermida, and F. Catthoor. A complete dec. 2009.