An Architecture and Compiler for Scalable On-Chip Communication

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2004 1 An Architecture and Compiler for Scalable On-Chip Communication Jian Liang, Student Member, IEEE, Andrew Laffely, Sriram Srinivasan, and Russell Tessier, Member, IEEE Abstract— tion of communication resources. Significant amounts of arbi- A dramatic increase in single chip capacity has led to a tration across even a small number of components can quickly revolution in on-chip integration. Design reuse and ease-of- form a performance bottleneck, especially for data-intensive, implementation have became important aspects of the design pro- cess. This paper describes a new scalable single-chip communi- stream-based computation. This issue is made more complex cation architecture for heterogeneous resources, adaptive System- by the need to compile high-level representations of applica- On-a-Chip (aSOC), and supporting software for application map- tions to SoC environments. The heterogeneous nature of cores ping. This architecture exhibits hardware simplicity and opti- in terms of clock speed, resources, and processing capability mized support for compile-time scheduled communication. To il- makes cost modeling difficult. Additionally, communication lustrate the benefits of the architecture, four high-bandwidth signal processing applications including an MPEG-2 video encoder modeling for interconnection with long wires and variable arbi- and a Doppler radar processor have been mapped to a prototype tration protocols limits performance predictability required by aSOC device using our design mapping technology. Through ex- computation scheduling. perimentation it is shown that aSOC communication outperforms Our platform for on-chip interconnect, adaptive System-On- a hierarchical bus-based system-on-chip (SoC) approach by up to a-Chip (aSOC), is a modular communications architecture. As a factor of five. A VLSI implementation of the communication architecture indicates clock rates of 400 MHz in 0.18 micron tech- shown in Fig. 1, an aSOC device contains a two-dimensional nology for sustained on-chip communication. In comparison to mesh of computational tiles. Each tile consists of a core and previously-published results for an MPEG-2 decoder, our on-chip an associated communication interface. The interface design interconnect shows a run-time improvement of over a factor of can be customized based on core datawidths and operating fre- four. quencies to allow for efficient use of resources. Communication between nodes takes place via pipelined, point-to-point con- nections. By limiting inter-core communication to short wires I. INTRODUCTION with predictable performance, high-speed communication can Recent advances in VLSI transistor capacity have led to dra- be achieved. A novel aspect of the architecture is its support matic increases in the amount of computation that can be perfor both compile-time scheduled and run-time dynamic trans- formed on a single chip. Current industry estimates [1] indicate fer of data. While scheduled data transfer has been directly mass production of silicon devices containing over one billion optimized, a software-based mechanism for run-time dynamic transistors by 2012. This proliferation of resources enables the routing has also been included. integration of complex system-on-a-chip (SoC) designs con- To support the aSOC architecture, an application mapping taining a wide range of intellectual property cores. To provide tool, AppMapper, has been developed to translate high-level high performance, SoC integrators must consider the design of language application representations to aSOC devices. Map- individual intellectual property (IP) cores, their on-chip inter- ping steps, including code optimization, code partitioning, connection, and application mapping approaches. In this paper, communication scheduling, and core-dependent compilation we address the latter two design issues through the introduction are part of the AppMapper flow. Although each step has been of a new on-chip communications architecture and supporting fully automated, design interfaces for manual intervention are application mapping software. Our communications architec- provided to improve mapping efficiency. Mapping algorithms ture is scalable to tens of cores and can be customized on a have been developed for both partitioning and scheduling based per-application basis. on heuristic techniques. A system-level simulator allows for Recent studies [1] have indicated that on-chip communica- performance evaluation prior to design implementation. tion has become the limiting factor in SoC performance. As Key components of the aSOC architecture, including the die sizes increase, the performance effect of lengthy, cross-chip communication interface architecture, have been simulated and communication becomes prohibitive. Future SoCs will require implemented in 0.18 micron technology. Experimentation a communication substrate that can support a variety of diverse shows a communication network speed of 400 MHz with an IP cores. Many contemporary bus-based architectures are lim- overhead added to on-chip IP cores that is similar to on-chip ited in terms of physical scope by the need for dynamic arbitra- bus overhead. The AppMapper tool has been fully imple- This work was supported in by part by the National Science Foundation un- mented and has been applied to four signal processing appli- der grants CCR-0081405 and CCR-9988238. J. Liang and R. Tessier are with cations including MPEG-2 encoding. These applications have the Department of Electrical and Computer Engineering, University of Mas- been mapped to aSOC devices containing up to 45 cores via sachusetts, Amherst, MA, 01003 USA. E-mail: [email protected] . A. Laffely is with the US Air Force, Hanscom AFB, MA, 01731 USA. S. Srini- the AppMapper tool. Performance comparisons between aSOC vasan is with Advanced Micro Devices, Austin, TX, 78741 USA. implementations and other more traditional on-chip communi- North although not yet implemented. MicroNetwork [10] provides on-chip communication via a pipelined interconnect. A rotat- Ctrl uProc ing resource arbitration scheme is used to coordinate inter-node West East Mul transfer for dynamic requests. This mechanism is limited by the need for extensive user interaction in design mapping. Xbar FPGA FPGA Packet-switched interconnect based on both compile-time static and run-time dynamic routing has been used effectively Core for multiprocessor communication for over 25 years. For iWarp South [11], inter-processor communication patterns were statically Mem determined during program compilation and implemented with Communication the aid of programmable, inter-processor buffers. This concept uProc Interface has been extended by the NuMesh project [12] to include col- Tile lections of heterogeneous processing elements interconnected in a mesh topology. Although pre-scheduled routing is ap- Fig. 1. Adaptive System-on-a-Chip (aSOC) propriate for static data flows with predictable communication paths, most applications rely on at least minimal run-time support for data-dependent data transfer. Often, this support takes cation substrates, such as an on-chip bus, shows an aSOC per- the form of complicated per-node dynamic routing hardware formance improvement of up to a factor of five. embedded within a communication fabric. A recent example of This paper presents a review of SoC-related communication this approach can be found in the Reconfigurable Architecture architectures in Section II. Section III reveals the design philos- Workstation (RAW) project [13]. In our system, we minimize ophy of our communication approach and describes the tech- hardware support for run-time dynamic routing through the use nique at a high level. The details of our communication archi- of software. tecture are explained in Section IV. Section V demonstrates the application mapping methodology and describes component algorithms. Section VI describes the benchmarks used to evalu- III. ASOC DESIGN PHILOSOPHY ate our approach and our validation methodology. In Section VII, experimental results for aSOC devices with up to 49 cores Successful deployment of aSOC requires the architectural are presented. These results are compared against the perfor- development of an inter-node communication interface, the cre- mance of alternative interconnect approaches and previously- ation of supporting design mapping software, and the success- published results. Section VIII summarizes our work and offers ful translation of target applications. Before discussing these suggestions for future work. issues, the basic operating model of aSOC interconnect is presented. II. RELATED WORK A. Design Overview Numerous on-chip interconnect approaches have been pro- posed commercially as a means to connect intellectual prop- As shown in Fig. 1, a standardized communication struc- erty cores. These approaches include arbitrated buses [2], [3], ture provides a convenient framework for the use of intellectual [4] and hierarchical buses connected via bridges [5], [6], [7]. property cores. A simple core interface protocol, joining the In general, all of these architectures have similar arbitration core to the communication network, creates architectural mod- characteristics to master/slave off-chip buses with several new ularity. By limiting inter-core communication to short wires features including data pipelining [2], replacement of tri-state exhibiting predictable performance, high-speed point-to-point

An Architecture and Compiler for Scalable On-Chip Communication

On-Chip Interconnect Schemes for Reconfigurable System-On-Chip

AXI Reference Guide

Wishbone Bus Architecture – a Survey and Comparison

An Overview of Soc Buses

Vitex-II Pro: the Platfom for Programmable Systems

Computing Platforms Chapter 4

Xilinx XAPP1000: Reference System : Plbv46 PCI Express in a ML555

Design and Implementation of Clocked Open Core Protocol Interfaces for Intellectual Property Cores and On-Chip Network Fabric

New Virtex™-II Pro Family Extends Platform FPGA Capability with Multi

View of the Thesis

Automated Bus Generation for Multi-Processor Soc Design

Synopsys Powerpc for Developerworks