Hardware Design, Synthesis, and Verification of a Multicore Communication API
Total Page:16
File Type:pdf, Size:1020Kb
Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, and Ganesh Gopalakrishnan University of Utah {meakin, ganesh}@cs.utah.edu AbstractÐModern trends in computer architecture and Instead of `reinventing the wheel,' it is imperative that semiconductor scaling are leading towards the design of chips semiconductor companies agree on a standard software API with more and more processor cores. Highly concurrent that, on one hand, offers high efficiency, but on the other hardware and software architectures are inevitable in future hand offers the ability to build and re-use software. The systems. One of the greatest problems in these systems is need for such a standardized API is underscored by the communication. Providing coherence, consistency, emergence of on-chip networks as the physical synchronization, and sharing of data in a multicore system communication mechanism (as opposed to busses) [2]. Thus requires that communication overhead is minimal. This paper the standard API must be able to mimic the functionality describes the design of tightly integrated hardware / software offered by MPI and threads ± but in a light-weight manner, mechanisms for providing extremely efficient on-chip communication. A multicore communication API (MCAPI), and in a manner that meshes well with the existing (bus- developed by the Multicore Association, is combined with based) and emerging (network based) hardware transport network-on-chip (NoC) technology to provide software mechanisms. developers with a scalable hardware accelerated transport This paper describes an effort to merge these two trends ± layer. The designs presented follow the methodology presented sophisticated transport mechanisms in hardware, driven by by MCAPI and target embedded multicore systems-on-chip reusable standard API based high performance software. As (SoC). The key hardware, synthesis, and verification the title suggests, there are three major components of this mechanisms are described. Actual and simulated results work. The hardware design component consists of creating suggest that the proposed solution offers significant HDL models of the key pieces of a scalable on-chip advantages over existing communication architectures in communication fabric and a processing core which terms of performance, power, scalability, and software implements an instruction set architecture capable of development efficiency. seamless integration with the chosen communication API. INTRODUCTION The synthesis component involves both logic synthesis and architectural synthesis of an application specific network- It is widely accepted that modern and future computing on-chip (NoC) topology using an internally developed tool. systems will see performance improvements primarily This tool also performs verification of deadlock-free routing through exploiting increased process/thread level functions for the generated custom topology, which is part parallelism. One of the main pre-requisites to exploiting this of the verification component of this project. IBM©s parallelism efficiently is the availability of APIs that are Sixthsense semi-formal verification tool is also incorporated well matched with the communication / synchronization to provide verification of the key hardware modules at a needs of this area. Clearly, ªone size does not fit all.º In the high level [3]. The semi-formal nature of this tool promises area of large-scale cluster computing, the Message Passing to provide significant coverage and an ability to verify Interface (MPI) ± a very sophisticated API with over 300 designs at a high enough level of abstraction to make functions -- is the lingua franca. MPI is used to program quality-of-service (QoS) guarantees for the resulting cluster computers with up to hundreds of thousands of communication fabric; a task that is quite difficult with processing nodes. In other realms such as embedded simulation. Portions of this project are currently a work in systems using commodity microprocessors, various real- progress and more substantial results will be provided in time operating system primitives and shared memory future publications. threads serve the needs of communication and synchronization. However, for the rapidly exploding area of MULTICORE API IMPLEMENTATION embedded systems based on multiple cores, chips not only contain multiple general purpose computing cores, but for A. MCAPI Overview cost effectiveness and performance also contain application The Multicore Communication API is a message passing specific accelerators, I/O interfaces, and memory interface that is similar to MPI. However, it is designed controllers. All of these on-chip devices need low overhead primarily for embedded devices where broad functionality is communication. As semiconductors continue to scale, more not as important as high performance in a few types of and more of these devices will be found on the same chip. communication. MCAPI provides the communication primitives that can be used by operating systems, libraries, footprint of the library implementation such that it is and applications to improve code portability across different suitable for embedded applications. hardware generations. Since the API is designed for on-chip communication, it makes few assumptions about the hardware architecture and leaves a lot of freedom for the implementation to take advantage of whatever optimizations may be permitted through the architecture. It aspires to provide good code portability across hardware generations. However, its simplicity also enables implementations with high performance communication, lower power consumption, and a lower memory footprint [1]. B. MIPS ISA Extension The core functionality of the implementation described in this paper is controllable via a set of RISC type assembly instructions added as an extension to the MIPS instruction set. These instructions are given in Fig. 1. The decision to make the control of the communication hardware programmable is based on an effort to avoid over Fig. 2. Send / Receive Implementation complicating the hardware. Using these instructions all of the communication ON-CHIP NETWORK DESIGN functionality of MCAPI can be implemented. The send header instruction builds the packet header and sends it on A. Topology Generation the network. It includes source and destination node MCAPI is intended to be used on embedded multicore identifiers, as well as a packet class that indicates the type of platforms. These types of systems are typically geared data being sent (i.e. pointer/buffer, short, integer, or long). towards performing one or more specific types of tasks such Note that zero copy data transfer through pointer passing is as network packet processing, graphics, or some type of possible only if both communication endpoints reside in the multimedia processing. In these applications regular same shared memory domain. The receive header patterns of communication are observed. Therefore, a instruction subsequently gets the packet header and writes it general purpose communication fabric is not optimal. As to a register. The header data can then be parsed using bit part of this work, a tool has been developed which mask and shift instructions (standard with the MIPS ISA). synthesizes an custom network topology for a given The send/receive word instructions send or receive word workload specification. The generated topology is length chunks of data. The get ID and flag instructions write optimized to reduce the average network hops per packet in the local node ID and the specified network status flag to an attempt to reduce power consumption and registers respectively. communication latency. The tool also attempts to find a placement of the network nodes such that wire lengths are minimized. The algorithms used by this tool are presented in detail in [4]. Fig. 1. Communication Instructions 1) Example Send/Receive Implementation: With these added instructions MCAPI can be implemented as a C library with in-line assembly code. An example implementation of MCAPI message send function is given in Fig. 2. For brevity, some error checking code has been omitted. Note that the code size of these functions is relatively small. This is critical to minimizing the memory Fig. 4. Example Topology proportional to the router radix. Fig. 5 depicts a 3 by 3 An example topology generated by the tool is given in router. The topology synthesis tool generates networks with Fig. 4. Here, a network node is a tile consisting of a router, a this in consideration. The user enters a maximum router network interface unit (NIU), a shared resource such as a radix based on the constraints of the system and a network processor, and an optional private resource such as an L1 is produced where no router is larger then the maximum size cache. and the average router size is minimized. This results in better average case performance that an asynchronous B. Asynchronous Router Design network can exploit. In systems such as this it is also essential that the communication fabric supports different clock domains. C. Packet Structure Each heterogeneous processing element will often have a MCAPI packets are divided into flits as shown in Fig. 6. different clock rate that is ideal. Therefore, the bridging of The head flit consists of a destination node and port ID, these different clock domains must be provided by the packet class, and sender ID. Different packet classes are network. This support is provided by designing the on-chip used to implement