Heracles: a Tool for Fast RTL-Based Design Space Exploration of Multicore Processors

Heracles: A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors Michel A. Kinsy Srinivas Devadas Michael Pellauer Department of Electrical Engineering and Intel Corporation Computer Science VSSAD Group Massachusetts Institute of Technology [email protected] mkinsy, [email protected] ABSTRACT 1. INTRODUCTION This paper presents Heracles, an open-source, functional, The ability to integrate various computation components parameterized, synthesizable multicore system toolkit. Such such as processing cores, memories, custom hardware units, a multi/many-core design platform is a powerful and versa- and complex network-on-chip (NoC) communication proto- tile research and teaching tool for architectural exploration cols onto a single chip has significantly enlarged the design and hardware-software co-design. The Heracles toolkit com- space in multi/many-core systems. The design of these sys- prises the soft hardware (HDL) modules, application com- tems requires tuning of a large number of parameters in piler, and graphical user interface. It is designed with a high order to find the most suitable hardware configuration, in degree of modularity to support fast exploration of future terms of performance, area, and energy consumption, for multicore processors of different topologies, routing schemes, a target application domain. This increasing complexity processing elements (cores), and memory system organiza- makes the need for efficient and accurate design tools more tions. It is a component-based framework with parameter- acute. ized interfaces and strong emphasis on module reusability. There are two main approaches currently used in the de- The compiler toolchain is used to map C or C++ based ap- sign space exploration of multi/many-core systems. One applications onto the processing units. The GUI allows the proach consists of building software routines for the differ- user to quickly configure and launch a system instance for ent system components and simulating them to analyze sys- easy factorial development and evaluation. Hardware mod- tem behavior. Software simulation has many advantages: ules are implemented in synthesizable Verilog and are FPGA i) large programming tool support; ii) internal states of platform independent. The Heracles tool is freely available all system modules can be easily accessed and altered; iii) under the open-source MIT license at: compilation/re-compilation is fast; and iv) less constraining http://projects.csail.mit.edu/heracles. in terms of number of components (e.g., number of cores) to simulate. Some of the most stable and widely used soft- Categories and Subject Descriptors ware simulators are Simics [14]{a commercially available full-system simulator{GEMS [21], Hornet [12], and Graphite C.1.2 [Computer Systems Organization]: Processor Ar- [15]. However, software simulation of many-core architec- chitecture - Single-instruction-stream, multiple-data-stream tures with cycle- and bit-level accuracy is time-prohibitive, processors (SIMD); B.5.1 [Hardware]: Register-Transfer- and many of these systems have to trade off evaluation ac- Level Implementation- Design. curacy for execution speed. Although such a tradeoff is fair and even desirable in the early phase of the design explo- General Terms ration, making final micro-architecture decisions based on Tool, Design, Experimentation, Performance these software models over truncated applications or application traces leads to inaccurate or misleading system characterization. Keywords The second approach used, often preceded by software Multicore Architecture Design, RTL-Based Design, FPGA, simulation, is register-transfer level (RTL) simulation or em- Shared Memory, Distributed Shared Memory, Network-on- ulation. This level of accuracy considerably reduces system Chip, RISC, MIPS, Hardware Migration, Hardware multi- behavior mis-characterization and helps avoid late discovery threading, Virtual Channel, Wormhole Router, NoC Rout- of system performance problems. The primary disadvantage ing Algorithm. of RTL simulation/emulation is that as the design size in- creases so does the simulation time. However, this problem can be circumvented by adopting synthesizable RTL and using hardware-assisted accelerators{field programmable gate Permission to make digital or hard copies of all or part of this work for arrays (FPGAs){to speed up system execution. Although personal or classroom use is granted without fee provided that copies are FPGA resources constrain the size of design one can im- not made or distributed for profit or commercial advantage and that copies plement, recent advances in FPGA-based design method- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific ologies have shown that such constraints can be overcome. permission and/or a fee. HAsim [18], for example, has shown using its time multiplex- FPGA’13, February 11-13, 2013, Monterey, California, USA. ing technique how one can model a shared-memory multicore Copyright 2013 ACM 978-1-4503-1887-7/13/02 ...$15.00. system including detailed core pipelines, cache hierarchy, RTL model for evaluating power and performance of NoC and on-chip network, on a single FPGA. RAMP Gold [20] architectures is presented in Banerjee et al [2]. Other designs is able to simulate a 64-core shared-memory target machine make use of multiple FPGAs. H-Scale [19], by Saint-Jean capable of booting real operating systems running on a sin- et al, is a multi-FPGA based homogeneous SoC, with RISC gle Xilinx Virtex-5 FPGA board. Fleming et al [7] propose a processors and an asynchronous NoC. The S-Scale version mechanism by which complex designs can be efficiently and supports a multi-threaded sequential programming model automatically partitioned among multiple FPGAs. with dedicated communication primitives handled at run- RTL design exploration for multi/many-core systems time by a simple operating system. nonetheless remain unattractive to most researchers because it is still a time-consuming endeavor to build such large designs from the ground up and ensure correctness at all levels. 3. HERACLES HARDWARE SYSTEM Furthermore, researchers are generally interested in one key Heracles presents designers with a global and complete system area, such as processing core and/or memory organi- view of the inner workings of the multi/many-core system zation, network interface, interconnect network, or operating at cycle-level granularity from instruction fetches at the pro- system and/or application mapping. Therefore, we believe cessing core in each node to the flit arbitration at the routers. that if there is a platform-independent design framework, It enables designers to explore different implementation pa- more specifically, a general hardware toolkit, which allows rameters: core micro-architecture, levels of caches, cache designers to compose their systems and modify them at will sizes, routing algorithm, router micro-architecture, distribu- and with very little effort or knowledge of other parts of the ted or shared memory, or network interface, and to quickly system, the speed versus accuracy dilemma in design space evaluate their impact on the overall system performance. It exploration of many-core systems can be further mitigated. is implemented with user-enabled performance counters and To that end we present Heracles, a functional, modular, probes. synthesizable, parameterized multicore system toolkit. It is a powerful and versatile research and teaching tool for 3.1 System overview architectural exploration and hardware-software co-design. Without loss in timing accuracy and logic, complete systems can be constructed, simulated and/or synthesized onto Application (single or multi-threaded C or C++) FPGA, with minimal effort. The initial framework is presented in [10]. Heracles is designed with a high degree of Software MIPS-based Linux GNU GCC cross compiler Environment modularity to support fast exploration of future multicore processors{different topologies, routing schemes, processing Hardware config-aware application mapping elements or cores, and memory system organizations by using a library of components, and reusing user-defined hardware blocks between different system configurations or pro- Component-based Processing Memory Network-on-chip jects. It has a compiler toolchain for mapping applications Hardware Design elements organization Topology and written in C or C++ onto the core units. The graphical selection configuration routing settings user interface (GUI) allows the user to quickly configure and launch a system instance for easily-factored development and evaluation. Hardware modules are implemented Evaluation in synthesizable Verilog and are FPGA platform indepen- Environment RTL-level simulation FPGA-based Emulation dent. Figure 1: Heracles-based design flow. 2. RELATED WORK In [6] Del Valle et al present an FPGA-based emulation Figure 1 illustrates the general Heracles-based design flow. framework for multiprocessor system-on-chip (MPSoC) ar- Full applications{written in single or multithreaded C or chitectures. LEON3, a synthesizable VHDL model of a 32- C++{can be directly compiled onto a given system instance bit processor compliant with the SPARC V8 architecture, using the Heracles MIPS-based GCC cross compiler. The has been used in implementing multiprocessor systems on detailed compilation process and application examples are FPGAs. Andersson et al [1], for example,

Load more