Parameters of a On-Chip Network for Multi-Processor System-On-Chip Environments

Parameters of a On-Chip Network for Multi-Processor System-on-Chip Environments Sang Kyun Kim Wook-Jin Chung [email protected] [email protected] and then follow up with future research directions (Section 1. INTRODUCTION 7). Section 8 concludes the paper. Network-on-chips (NoC) are replacing the conventional way of on-chip communication via buses (global wires) as we head into the billion transistor era. Such a transition is 2. THE PROBLEM inevitable as the relative speed of wires do not improve in There seems to be a virtual consensus in that NoCs should length alongside the technology scaling of transistors [4]. In resemble interconnect architectures of high-performance addition, inserting regular switching fabrics is relatively parallel computing systems [2]. Thus the conventional straightforward within a modular system-on-chip (SoC) performance metrics of throughput and latency are still paradigm. This permits systematic network scaling and applicable. However, due to the implications from the on- alleviates the VLSI design complexity problem [5][8]. chip multi-processor environment (detailed in this section), the optimization goals are unclear alongside the set of As single processor performance approaches an inevitable metrics and parameters being incomplete. wall, imposed by a multitude of factors, the exploitation of thread level parallelism (TLP) using chip multi-processors 2.1 Network on Silicon (CMP) is an attractive option [6]. In conjunction with the In today’s integrated circuit (IC) design, virtually all billion transistor era, this motivates multi-processor system- engineers must consider the power budget whereas on-chips (MPSoC) where computational cores, memory and transistors are practically free. This constraint is especially I/O all can reside on the same silicon. The network to prominent in architecting processors [10]. Adding an entire connect these components thus becomes a critical factor in network to a chip can only worsen the already tight building these chips. limitation in power and thus energy efficient NoCs are highly in favor. Similarly, the additional silicon area Multi-processor interconnect architecture has been well required for implementing the network also should not be studied and key design parameters have been identified (for overlooked as it relates strongly to manufacturability and example, Cray supercomputers’ interconnection networks) packaging. [7]. However, those results cannot be directly applied to MPSoC environments as it presents additional constraints On the other hand, there also are benefits when conceiving as well as granting new resources. More importantly, a network entirely on silicon. The on-chip wires are much because network architectures are highly sensitive in shorter (compared to wires connecting neighboring chips), regards to being optimal for a particular task, it is difficult drastically reducing the propagation latency. Moreover, no to generate a good solution via variable tweaking. longer is the number of pins on a node a concern which then allows vastly wide channels. With this paper, we attempt to identify the important parameters involved in constructing on-chip networks for 2.2 Multi-Processor Implications MPSoCs. We have chosen three previous works [1], [2], [3] As we restrict our NoC scope to MPSoC interconnects, this where each captures a portion of the answer to our question. gives us a general direction to focus on because By analytically merging the boundaries of the papers, we homogenous tiled processors should exhibit a particular hope to develop a higher level view that can be helpful in traffic distribution, both in space and time. However, gaining a more complete picture of the design space. having processing cores as the network’s client also complicates the problem as now the performance of the The remainder of this paper is organized as follows. In processors running applications must be considered when Section 2, we describe in detail the difficulties of evaluating that of the network. It is not just that one should formulating the design space. Sections 3, 4 and 5 are not be a bottleneck of the other but rather the two should be dedicated to each of the papers where we discuss their designed in concert to attain optimal performance for a contributions in solving the problem. We combine and finite set of resources. analyze the knowledge gained from the papers in Section 6 2.3 Interdependencies As seen with the relationship between the processors and the on-chip network, they are not independent because they contend for the same resource (silicon area) and share common or similar parameters (packet size). It is these interdependencies within the design spectrum that makes the problem of identifying key parameters hard. Due to this complex nature, all the papers of our selection ([1], [2], [3]) had to cull the exploration space in order to extract meaningful results or insights. In the following sections, our critique touches on this aspect as to whether such narrowing of scope is justifiable and how the results should be interpreted. Figure 1. CMesh topology on 64 nodes. The express channels at the boundary routers permit equal degrees. 3. BALFOUR AND DALLY [1] 3.1 Overview exhibited the fastest completion time with the smallest chip In this paper, the authors develop comprehensive analytical area, as well as having relatively low power dissipation. area and energy models specifically for MPSoC interconnection networks. With these, they then explore 3.2 Main Contributions how parameters such as topology, routing algorithm, and Architecting the CMeshX2 is a large contribution of this router buffer size affect throughput, area and energy. Using paper. However it is equally as important to know what the insights acquired from the experimental results, a new makes the design so efficient. Indeed the authors identify NoC architecture is proposed. two critical characteristics that allow CMeshX2 to excel: concentration and having an independent second network. A 64 tile CMP is considered for the on-chip target Concentration of nodes provides a more compact layout environment. Each tile contains a processing core as well as and reduces wire, allowing wider channel widths. Also a portion of the memory (256KB). The communication fewer routers permit a lower hop count without increasing protocol between nodes is abstracted to request and replies wiring complexity (as opposed to tree-based designs). where packets are of two lengths: 64 and 576 bits. Combined, the overall latency is drastically reduced. The Balfour models the area and energy of the network added second network, not possible with all topologies, components into a set of equations, parameterized in terms further reduces completion time via doubling peak of the configuration that defines the architecture being throughput. Cost of the second network is higher power evaluated. This configuration includes channel width, router (but not area). buffer depth, and also technology dependent parameters Knowing why CMeshX2 shines, this in turn reveals the such as metal pitch and transistor capacitance. The optimal weakness of tree based networks. Trees too are effective in set of values is found by sweeping these variables and reducing the average hop count and results indicate that measuring the performance of the network. For a systematic they have better efficiencies than MeshX2. However, this comparison, performance is defined with area efficiency gain pays the high price in wiring complexity. Not only (work completion time × chip area) and energy efficiency does complex wiring inhibit wider channels (which may be (work completion time × energy dissipated). acceptable because on-chip environment provides wider The paper evaluates the following networks: Mesh/MeshX2, channels) but lengthens wires on average increasing energy Torus, Concentrated Mesh (CMesh)/CMeshX2, Fat-Tree dissipation. In addition, the irregularity of the wiring layout (FTree), and Tapered Fat-Tree (TTree) [11]. CMesh is the makes the insertion of the second network more difficult. new architecture proposed by the authors. As illustrated in Figure 1, it is a radix-4 mesh where each router serves 4 3.3 Critique tiles. MeshX2 and CMeshX2 contain 2 complete set of The paper compares NoC architectures with two primary networks in parallel. This design is reasonable as both performance metrics, area and energy efficiency. Since the Mesh and CMesh have enough unused silicon area for the workload completion time is a factor in both, it indeed is second network. the most pivotal measure in the evaluation methodology. The workload used for this purpose was a mixture of The experimental results strongly favor the CMeshX2 common synthetic traffic patterns and CMeshX2 performed architecture as it has both the best area and energy relatively well on all (to varying degrees). In spite of this, efficiency. Compared to all other topologies, CMeshX2 we are concerned with the fact that all synthetic traffic Figure 2. Number of virtual channels swept on all topologies. (a) shows the throughput, (b) the average message latency (cycles), and (c) graphs the average energy dissipation per packet (nJ). patterns were weighted equally. Clearly uniform and taper on-chip network topologies of the time (2005), the authors traffic are more relevant to a CMP environment than others. objectively sweep multiple parameters to determine their However, large time savings from less relevant patterns also effect on performance. The findings of this paper

Load more