Parameters of a On-Chip Network for Multi- System-on-Chip Environments

Sang Kyun Kim Wook-Jin Chung [email protected] [email protected]

and then follow up with future research directions (Section 1. INTRODUCTION 7). Section 8 concludes the paper. Network-on-chips (NoC) are replacing the conventional way of on-chip via buses (global wires) as we head into the billion transistor era. Such a transition is 2. THE PROBLEM inevitable as the relative speed of wires do not improve in There seems to be a virtual consensus in that NoCs should length alongside the technology scaling of transistors [4]. In resemble interconnect architectures of high-performance addition, inserting regular switching fabrics is relatively systems [2]. Thus the conventional straightforward within a modular system-on-chip (SoC) performance metrics of and latency are still paradigm. This permits systematic network scaling and applicable. However, due to the implications from the on- alleviates the VLSI design complexity problem [5][8]. chip multi-processor environment (detailed in this section), the optimization goals are unclear alongside the set of As single processor performance approaches an inevitable metrics and parameters being incomplete. wall, imposed by a multitude of factors, the exploitation of level parallelism (TLP) using chip multi-processors 2.1 Network on Silicon (CMP) is an attractive option [6]. In conjunction with the In today’s (IC) design, virtually all billion transistor era, this motivates multi-processor system- engineers must consider the power budget whereas on-chips (MPSoC) where computational cores, memory and transistors are practically free. This constraint is especially I/O all can reside on the same silicon. The network to prominent in architecting processors [10]. Adding an entire connect these components thus becomes a critical factor in network to a chip can only worsen the already tight building these chips. limitation in power and thus energy efficient NoCs are highly in favor. Similarly, the additional silicon area Multi-processor interconnect architecture has been well required for implementing the network also should not be studied and key design parameters have been identified (for overlooked as it relates strongly to manufacturability and example, Cray supercomputers’ interconnection networks) packaging. [7]. However, those results cannot be directly applied to MPSoC environments as it presents additional constraints On the other hand, there also are benefits when conceiving as well as granting new resources. More importantly, a network entirely on silicon. The on-chip wires are much because network architectures are highly sensitive in shorter (compared to wires connecting neighboring chips), regards to being optimal for a particular task, it is difficult drastically reducing the propagation latency. Moreover, no to generate a good solution via variable tweaking. longer is the number of pins on a node a concern which then allows vastly wide channels. With this paper, we attempt to identify the important parameters involved in constructing on-chip networks for 2.2 Multi-Processor Implications MPSoCs. We have chosen three previous works [1], [2], [3] As we restrict our NoC scope to MPSoC interconnects, this where each captures a portion of the answer to our question. gives us a general direction to focus on because By analytically merging the boundaries of the papers, we homogenous tiled processors should exhibit a particular hope to develop a higher level view that can be helpful in traffic distribution, both in space and time. However, gaining a more complete picture of the design space. having processing cores as the network’s client also complicates the problem as now the performance of the The remainder of this paper is organized as follows. In processors running applications must be considered when Section 2, we describe in detail the difficulties of evaluating that of the network. It is not just that one should formulating the design space. Sections 3, 4 and 5 are not be a bottleneck of the other but rather the two should be dedicated to each of the papers where we discuss their designed in concert to attain optimal performance for a contributions in solving the problem. We combine and finite set of resources. analyze the knowledge gained from the papers in Section 6 2.3 Interdependencies As seen with the relationship between the processors and the on-chip network, they are not independent because they contend for the same resource (silicon area) and share common or similar parameters (packet size). It is these interdependencies within the design spectrum that makes the problem of identifying key parameters hard. Due to this complex nature, all the papers of our selection ([1], [2], [3]) had to cull the exploration space in order to extract meaningful results or insights. In the following sections, our critique touches on this aspect as to whether such narrowing of scope is justifiable and how the results should be interpreted. Figure 1. CMesh topology on 64 nodes. The express channels at the boundary routers permit equal degrees. 3. BALFOUR AND DALLY [1] 3.1 Overview exhibited the fastest completion time with the smallest chip In this paper, the authors develop comprehensive analytical area, as well as having relatively low power dissipation. area and energy models specifically for MPSoC interconnection networks. With these, they then explore 3.2 Main Contributions how parameters such as topology, routing , and Architecting the CMeshX2 is a large contribution of this buffer size affect throughput, area and energy. Using paper. However it is equally as important to know what the insights acquired from the experimental results, a new makes the design so efficient. Indeed the authors identify NoC architecture is proposed. two critical characteristics that allow CMeshX2 to excel: concentration and having an independent second network. A 64 tile CMP is considered for the on-chip target Concentration of nodes provides a more compact layout environment. Each tile contains a processing core as well as and reduces wire, allowing wider channel widths. Also a portion of the memory (256KB). The communication fewer routers permit a lower hop count without increasing protocol between nodes is abstracted to request and replies wiring complexity (as opposed to tree-based designs). where packets are of two lengths: 64 and 576 bits. Combined, the overall latency is drastically reduced. The Balfour models the area and energy of the network added second network, not possible with all topologies, components into a set of equations, parameterized in terms further reduces completion time via doubling peak of the configuration that defines the architecture being throughput. Cost of the second network is higher power evaluated. This configuration includes channel width, router (but not area). buffer depth, and also technology dependent parameters Knowing why CMeshX2 shines, this in turn reveals the such as metal pitch and transistor capacitance. The optimal weakness of tree based networks. Trees too are effective in set of values is found by sweeping these variables and reducing the average hop count and results indicate that measuring the performance of the network. For a systematic they have better efficiencies than MeshX2. However, this comparison, performance is defined with area efficiency gain pays the high price in wiring complexity. Not only (work completion time × chip area) and energy efficiency does complex wiring inhibit wider channels (which may be (work completion time × energy dissipated). acceptable because on-chip environment provi