Hardware Design, Synthesis, and Verification of a Multicore Communication API

Ben Meakin, and Ganesh Gopalakrishnan University of Utah {meakin, ganesh}@cs.utah.edu

Abstract—Modern trends in computer architecture and Instead of ‘reinventing the wheel,’ it is imperative that semiconductor scaling are leading towards the design of chips semiconductor companies agree on a standard software API with more and more processor cores. Highly concurrent that, on one hand, offers high efficiency, but on the other hardware and software architectures are inevitable in future hand offers the ability to build and re­use software. The systems. One of the greatest problems in these systems is need for such a standardized API is underscored by the communication. Providing coherence, consistency, emergence of on­chip networks as the physical synchronization, and sharing of data in a multicore system communication mechanism (as opposed to busses) [2]. Thus requires that communication overhead is minimal. This paper the standard API must be able to mimic the functionality describes the design of tightly integrated hardware / software offered by MPI and threads – but in a light­weight manner, mechanisms for providing extremely efficient on­chip communication. A multicore communication API (MCAPI), and in a manner that meshes well with the existing (bus­ developed by the Multicore Association, is combined with based) and emerging (network based) hardware transport network­on­chip (NoC) technology to provide software mechanisms. developers with a scalable hardware accelerated transport This paper describes an effort to merge these two trends – layer. The designs presented follow the methodology presented sophisticated transport mechanisms in hardware, driven by by MCAPI and target embedded multicore systems­on­chip reusable standard API based high performance software. As (SoC). The key hardware, synthesis, and verification the title suggests, there are three major components of this mechanisms are described. Actual and simulated results work. The hardware design component consists of creating suggest that the proposed solution offers significant HDL models of the key pieces of a scalable on­chip advantages over existing communication architectures in communication fabric and a processing core which terms of performance, power, scalability, and software implements an instruction set architecture capable of development efficiency. seamless integration with the chosen communication API.

INTRODUCTION The synthesis component involves both logic synthesis and architectural synthesis of an application specific network­ It is widely accepted that modern and future computing on­chip (NoC) topology using an internally developed tool. systems will see performance improvements primarily This tool also performs verification of deadlock­free routing through exploiting increased process/thread level functions for the generated custom topology, which is part parallelism. One of the main pre­requisites to exploiting this of the verification component of this project. IBM's parallelism efficiently is the availability of that are Sixthsense semi­formal verification tool is also incorporated well matched with the communication / synchronization to provide verification of the key hardware modules at a needs of this area. Clearly, “one size does not fit all.” In the high level [3]. The semi­formal nature of this tool promises area of large­scale cluster computing, the Message Passing to provide significant coverage and an ability to verify Interface (MPI) – a very sophisticated API with over 300 designs at a high enough level of abstraction to make functions ­­ is the lingua franca. MPI is used to program quality­of­service (QoS) guarantees for the resulting cluster computers with up to hundreds of thousands of communication fabric; a task that is quite difficult with processing nodes. In other realms such as embedded simulation. Portions of this project are currently a work in systems using commodity microprocessors, various real­ progress and more substantial results will be provided in time primitives and shared memory future publications. threads serve the needs of communication and synchronization. However, for the rapidly exploding area of MULTICORE API IMPLEMENTATION embedded systems based on multiple cores, chips not only contain multiple general purpose computing cores, but for A. MCAPI Overview cost effectiveness and performance also contain application The Multicore Communication API is a message passing specific accelerators, I/O interfaces, and memory interface that is similar to MPI. However, it is designed controllers. All of these on­chip devices need low overhead primarily for embedded devices where broad functionality is communication. As semiconductors continue to scale, more not as important as high performance in a few types of and more of these devices will be found on the same chip. communication. MCAPI provides the communication primitives that can be used by operating systems, libraries, footprint of the library implementation such that it is and applications to improve code portability across different suitable for embedded applications. hardware generations. Since the API is designed for on­chip communication, it makes few assumptions about the hardware architecture and leaves a lot of freedom for the implementation to take advantage of whatever optimizations may be permitted through the architecture. It aspires to provide good code portability across hardware generations. However, its simplicity also enables implementations with high performance communication, lower power consumption, and a lower memory footprint [1]. B. MIPS ISA Extension The core functionality of the implementation described in this paper is controllable via a set of RISC type assembly instructions added as an extension to the MIPS instruction set. These instructions are given in Fig. 1. The decision to make the control of the communication hardware programmable is based on an effort to avoid over Fig. 2. Send / Receive Implementation complicating the hardware. Using these instructions all of the communication ON­CHIP NETWORK DESIGN functionality of MCAPI can be implemented. The send header instruction builds the packet header and sends it on A. Topology Generation the network. It includes source and destination node MCAPI is intended to be used on embedded multicore identifiers, as well as a packet class that indicates the type of platforms. These types of systems are typically geared data being sent (i.e. pointer/buffer, short, integer, or long). towards performing one or more specific types of tasks such Note that zero copy data transfer through pointer passing is as network packet processing, graphics, or some type of possible only if both communication endpoints reside in the multimedia processing. In these applications regular same shared memory domain. The receive header patterns of communication are observed. Therefore, a instruction subsequently gets the packet header and writes it general purpose communication fabric is not optimal. As to a register. The header data can then be parsed using bit part of this work, a tool has been developed which mask and shift instructions (standard with the MIPS ISA). synthesizes an custom network topology for a given The send/receive word instructions send or receive word workload specification. The generated topology is length chunks of data. The get ID and flag instructions write optimized to reduce the average network hops per packet in the local node ID and the specified network status flag to an attempt to reduce power consumption and registers respectively. communication latency. The tool also attempts to find a placement of the network nodes such that wire lengths are minimized. The algorithms used by this tool are presented in detail in [4].

Fig. 1. Communication Instructions

1) Example Send/Receive Implementation: With these added instructions MCAPI can be implemented as a library with in­line assembly code. An example implementation of MCAPI message send function is given in Fig. 2. For brevity, some error checking code has been omitted. Note that the code size of these functions is relatively small. This is critical to minimizing the memory Fig. 4. Example Topology proportional to the router radix. Fig. 5 depicts a 3 by 3 An example topology generated by the tool is given in router. The topology synthesis tool generates networks with Fig. 4. Here, a network node is a tile consisting of a router, a this in consideration. The user enters a maximum router network interface unit (NIU), a shared resource such as a radix based on the constraints of the system and a network processor, and an optional private resource such as an L1 is produced where no router is larger then the maximum size cache. and the average router size is minimized. This results in better average case performance that an asynchronous B. Asynchronous Router Design network can exploit. In systems such as this it is also essential that the communication fabric supports different clock domains. C. Packet Structure Each heterogeneous processing element will often have a MCAPI packets are divided into flits as shown in Fig. 6. different clock rate that is ideal. Therefore, the bridging of The head flit consists of a destination node and port ID, these different clock domains must be provided by the packet class, and sender ID. Different packet classes are network. This support is provided by designing the on­chip used to implement the functionality of MCAPI. Since data routers with asynchronous pipelines. This has been can be sent as scalar values of various bit widths and as accomplished by specifying the pipeline controllers as pointers, these packet classes tell the receiver how to extended burst mode finite state machines and synthesizing interpret the data. Since network resources along a a circuit implementation with 3D [5]. communication path are reserved when a header is seen and The general architecture of the routers is given in Fig. 5. released when an associated tail flit is seen, opening and It uses wormhole flow control. Thus, packets are divided closing a channel is as simple as sending a header/tail flit. into flits and pass through the network in a pipelined Therefore, packet and scalar channels are implemented with fashion. There are two stages at each router. The first stage the same instructions used to send connectionless messages; consists of input buffering and computing the routing the difference being that channel communication can consist function. A flit is identified as either a header, tail, or body of an arbitrary number of packets. flit. If it is a header, then the routing path is looked up and a request is sent to the arbiter. If it is a body flit then the routing path has already been computed and mutual exclusion for the output port has already been obtained, so it is simply forwarded through the crossbar. If it is a tail flit, then the input and output buffers that have been reserved for the packet are released. The second stage is the arbitration stage. Multiple requests for a single output port are resolved in a round­robin fashion and flits are buffered at the output port until they are granted access to the input port of the next router.

Fig. 6. Packet Structure

D. Network Interface Unit To integrate the modified MIPS processor with the globally asynchronous network, a network interface unit is necessary. It performs handshaking with the network and provides more substantial buffering since the routers only have buffer space for two flits on any input/output path. It is also responsible for building the packets given two input operands from the processor and an opcode. Operational flags are also maintained and provided to the via the instruction set. These flags provide information Fig. 5. Asynchronous Router Block Diagram about communication errors, buffer overflow, network The logic delay, area, and power consumption go up usage, and data availability. SYNTHESIS the topology comparison, simulation results are provided in A. Cost of Instruction Set Extension table 4. Note that the computation of performance metrics To evaluate the practicality of extending the MIPS follows the equations given in [6]. Both topologies were instruction set it is important to have an idea of what the simulated with the same input parameters and the same additional hardware cost of the extension is. Table 1 workload used previously. compares the synthesis results of a VHDL MIPS core (with Table 4. Simulator Results Mesh Custom only integer operations) for a Xilinx Virtex 5 FPGA. Table 1. ISA Extension Cost Avg. Hops / flit 2.83 1.23 Baseline ISA Extended ISA Avg. Packet Delay 595 412 Registers 246 246 Throughput 0.23 0.27 LUTs 474 493 Clock Rate 157 MHz 156 MHz CONCLUSION An implementation of an embedded multicore B. Cost of Network Modules communication API based on NoC technology has been The network hardware modules can vary in size and presented. It has been shown that the proposed NoC complexity, depending on the requirements of the generated communication fabric is an efficient mechanism for inter­ topology. Network interface units also vary in their core communication and provides a platform for a powerful functionality depending on what type of core they are MCAPI implementation. The integration of MCAPI with interfacing with the network. Synthesis results for a router this communication architecture provides a useful design of radix five and NIU for a MIPS core are given in table 2. flow for implementing parallel algorithms written with Table 2. Cost of Network Hardware MCAPI in programmable logic. NIU Router The synthesis results indicate that the ISA extension Registers 54 155 proposed has a minimal cost. The Virtex 5 FPGA has over LUTs 226 366 19000 slice registers and LUTs, so a significant number of Clock rate / delay 327 MHz 3.6 ns these nodes could be implemented on a single chip. The simulation results indicate that the use of a custom topology SIMULATED RESULTS is a worthwhile investment and provides the best A. Custom vs. Mesh Topology communication solution for embedded applications. To illustrate the advantages provided by using the custom Future work on this project includes implementing the application specific network generated by our tool it is entire MCAPI specification with the proposed instruction compared with a standard mesh topology executing the set extension and developing the necessary infrastructure so same workload. Table 3 gives a comparison of the average that a complete MCAPI program could execute on an FPGA network hops per flit, the average wire distance traveled by for co­processing of parallel algorithms. The incorporation a flit, and the minimum hardware cost of the network to of the Sixthsense tool as described in the introduction is also ensure correct routing in terms of links, virtual channels, a work in progress. and average router radix. The test case is a 16 core ACKNOWLEDGEMENTS heterogeneous system represented in Fig. 4 where the smallest tile is 20 x 40 microns (io1, etc.) and the largest is This work has been funded by NSF award CCF­0811429 120 x 40 microns (cache1, etc.). and SRC task ID 1847.001. Table 3. Mesh vs Custom Topology Mesh Custom REFERENCES Avg. Hops / flit 2.62 1.16 [1] Multicore Association, “Multicore Communications API Specification Avg. Distance 154 microns 55.4 microns Version 1,” www.multicore­association.org. [2] W. Dally and B. Towles, “Route Packets, not Wires: On­Chip Tot. Links 48 46 Interconnection Networks,” Proceedings of 38th DAC 2001. Tot. VCs 48 59 [3] J. Baumgartner, V. Paruthi, R. Kanzelman, and H. Mony, “Semi­ Formal Verification at IBM,” HLDVT Nov. '06. Avg. Radix 4 3.88 [4] B. Meakin and G. Gopalakrishnan, “Workload Driven Synthesis of On­Chip Networks for Embedded Multicores,” under submission. B. gpNoCSim Simulator [5] K. Yun and D. Dill, “Automatic Synthesis of Extended Burst Mode Circuits Part I,” IEEE Transactions on CAD of IC and Systems 1998. To obtain an accurate estimation of potential MCAPI [6] H. Hossain, M. Ahmed, A. Al­Nayeem, T. Islam, and M. Akbar, communication performance, the network­on­chip simulator “GPNOCSIM – A General Purpose Simulator for Network­on­Chip,” ICICT 2007. described in [6] was extended to handle custom topologies and integrated with our network generation tool. Continuing