<<

A Highly Modular Router Microarchitecture for Networks-on-Chip

Item Type text; Electronic Dissertation

Authors Wu, Wo-Tak

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 01/10/2021 08:12:16

Link to Item http://hdl.handle.net/10150/631277 A HIGHLY MODULAR ROUTER MICROARCHITECTURE FOR NETWORKS-ON-CHIP

by

Wo-Tak Wu

Copyright c Wo-Tak Wu 2019

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

2019 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Wo-Tak Wu, titled A HIGHLY MODULAR ROUTER MICROARCHITECTURE FOR NETWORKS-ON-CHIP and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.

Dr. Linda Powers

--~-__:::::____ ---?---- ______Date: August 7, 2018 Dr. Roman Lysecky

Final approval and acceptance of this dissertation is contingent upon the candidate's submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

_____(/2 __·...... ~"------\;-~=------· __ Date: August 7, 2018 Dissertation Director: Dr. Janet Roveda 3

Acknowledgements

I would like to express my gratitude to Prof. Ahmed Louri for introducing me to network- on-chip, an exciting new area in computer architecture research. Under his guidance and support, I was able to learn a great deal in this area, and we were able to publish our research results [1]. Unfortunately, Prof. Louri moved on to George Washington University, but I decided to stay here at UA. I picked up a totally different research direction in network- on-chip. This dissertation represents the results of the second half of my research career at UA. Most importantly, I would like to thank my current advisors, Profs. Janet Roveda and Linda Powers. They are truly great teachers and mentors. Without their guidance, support and encouragement, it would not be possible to finish my graduate studies at UA. I would also like to thank Prof. Roman Lysecky for serving on the dissertation committee and the written comprehensive exam committee a few years ago. On a personal note, I must thank my family, especially my wife Katie, for their unconditional love and support. Without them, this long journey would never have even started. 4

Contents

List of Figures8

List of Tables 10

Abstract 11

Chapter 1 Introduction 13 1.1 Chip Multiprocessor ...... 13 1.2 Network-on-Chip ...... 16 1.3 Motivation...... 17 1.4 Contributions...... 21 1.5 Dissertation Outline ...... 21

Chapter 2 Network-on-Chip Basics 23 2.1 Bus-Based Interconnect...... 23 2.2 NoC ...... 25 2.2.1 Wires ...... 26 2.2.2 Router...... 26 2.2.3 Latency ...... 27 2.2.4 Power ...... 27 2.2.5 Area...... 28 2.3 Features...... 28 5

2.3.1 Bandwidth...... 28 2.3.2 Scalability...... 28 2.3.3 Parallel Communications...... 29 2.3.4 Clock Frequency ...... 30 2.3.5 Fault Tolerance...... 30 2.3.6 Power Consumption ...... 31 2.3.7 System Integration...... 31 2.3.8 Chip Layout...... 31 2.3.9 Clock Distribution ...... 32 2.3.10 Packet Switching...... 32 2.3.11 Summary ...... 32 2.4 Key NoC Characteristics...... 33 2.4.1 Topology...... 34 2.4.2 Routing Algorithm...... 36 2.4.3 Flow Control ...... 37 2.5 Challenges...... 38

Chapter 3 Omega Router 40 3.1 Conventional Router ...... 40 3.1.1 Microarchitecture...... 41 3.1.2 Router Pipeline...... 43 3.2 Omega Microarchitecture...... 44 3.2.1 Top-Level Design...... 46 3.2.2 Exchange ...... 47 3.2.3 Datapath ...... 50 3.2.4 Timing...... 52 3.2.5 Routing ...... 53 6

3.3 Evaluations ...... 56 3.3.1 Simulator ...... 56 3.3.2 VLSI Design Tools ...... 58 3.3.3 Network Configurations ...... 58 3.3.4 Network Traffic...... 60 3.3.5 Simulation Platform ...... 64 3.3.6 Running Simulations...... 65 3.4 Experiments and Results...... 66 3.4.1 Synthetic Traffic ...... 67 3.4.2 PARSEC Applications...... 68 3.4.3 Circuit Synthesis ...... 68 3.5 Analyses...... 73 3.5.1 Network Latency...... 74 3.5.2 Network Saturation...... 77 3.5.3 Network Throughput...... 79 3.5.4 Area and Power...... 80 3.5.5 Critical Path Delay...... 81 3.5.6 Summary ...... 82 3.6 Related Work...... 82 3.7 Discussion...... 84

Chapter 4 Circuit Implementation 86 4.1 Route Computation...... 86 4.1.1 Inter-Router...... 87 4.1.2 Inter-Exchange ...... 87 4.2 Buffer ...... 89 4.2.1 Write...... 89 7

4.2.2 Read...... 91 4.3 Output Arbiter...... 91 4.4 Buffer Arbiter...... 94 4.5 Summary ...... 95

Chapter 5 Buffer and Link Utilization Improvement 96 5.1 Motivation...... 96 5.2 Microarchitecture Enhancement...... 97 5.2.1 Merging...... 98 5.2.2 Splitting...... 99 5.3 Evaluations ...... 100 5.4 Results and Analysis...... 102 5.4.1 Network Latency ...... 102 5.4.2 Network Saturation...... 102 5.4.3 Network Throughput...... 103 5.4.4 Area, Power and Critical Path Delay...... 104 5.5 Summary ...... 107

Chapter 6 Conclusion 108

Bibliography 112 8

List of Figures

1.1 42 Years of Microprocessor Trend Data...... 15

1.2 Generic CMP System Configuration...... 16

1.3 A network-on-chip connecting CMP and off-chip memory...... 18

1.4 Link/Buffer utilization at various widths...... 19

2.1 A single bus connecting all processing cores...... 24

2.2 A simple RC model of a circuit...... 24

2.3 A 4 × 4 mesh network with four concurrent communication paths...... 29

2.4 Message format...... 34

2.5 Network topology examples...... 35

2.6 (a) Packet traverses from Node 6 to Node 2. (b) Routing tables...... 37

3.1 Conventional router microarchitecture...... 41

3.2 Router pipeline...... 44

3.3 Conventional router microarchitecture datapath...... 45

3.4 Omega router microarchitecture with a ring network of exchanges...... 46

3.5 (a) Exchange interface. (b) Exchange internal design...... 48

3.6 Exchange buffer arbiters...... 50

3.7 Omega router datapath...... 51

3.8 Omega router buffer read and write operations...... 53 9

3.9 Time-space diagram showing how a flit traverses a series of exchanges with five types of operations...... 54 3.10 Average network latencies from 6 synthetic traffic patterns, full range. . . . . 69 3.11 Average network latencies from 6 synthetic traffic patterns, up to saturation. 70 3.12 Average network throughputs in all synthetic traffic...... 71 3.13 Average network latencies from PARSEC applications...... 72 3.14 Average network latencies from 6 traffic patterns...... 74 3.15 Saturation points from 6 traffic patterns...... 77 3.16 Performances normalized to base-1-8...... 83

4.1 High level view of buffer...... 90 4.2 Write operation of buffer...... 90 4.3 Read operation of buffer...... 91 4.4 Output arbiter high-level view...... 92 4.5 Buffer arbiter high-level view...... 95

5.1 Datapath of merging flit at the input of a virtual channel...... 98 5.2 Datapath of splitting flit at the front-end of an exchange...... 100 5.3 Average network latencies from synthetic uniform traffic, up to saturation. . 103 5.4 Average network latencies from synthetic uniform traffic, full range...... 104 5.5 Flit merging percentage in the router...... 105 5.6 Average packet throughputs from synthetic uniform traffic...... 107 10

List of Tables

2.1 Summary of features of NoCs...... 33

3.1 Routing inside a router...... 55 3.2 Circuit synthesis parameters...... 60 3.3 Network parameters...... 60 3.4 PARSEC simulation completion status on Omega router...... 73 3.5 Router area and power consumptions and slacks...... 73 3.6 Network latency reduction over the baseline design (base-1-8)...... 75 3.7 Number of virtual channels in router designs...... 75 3.8 Network latency reduction over the baseline design (base-1-8)...... 76 3.9 Network saturation increase over the baseline design (base-1-8)...... 78 3.10 Router area and power consumptions over the baseline design (base-1-8). . . 80 3.11 Critical path delays of various router designs...... 81

5.1 Flit segment redirecting during merging...... 99 5.2 Flit segment redirecting during splitting...... 101 5.3 Network parameters for evaluating flit merging...... 101 5.4 Network saturation comparisons...... 106 5.5 Router area and power consumptions and slacks...... 106 5.6 Increases in area, power and critical path delay of omega-2-m over omega-2. 106 11

Abstract

Advances in semiconductor process technology in the past several decades have brought about an abundance of transistors that can be fabricated on a single silicon die. Micropro- cessor designers have been integrating more and more processing cores on-chip by taking advantage of such abundance. Network-on-Chip (NoC) has become a popular choice for connecting a large number of processing cores in chip multiprocessor designs. NoC provides many advantages over the traditional bus-based approach in terms of bandwidth, scalability, latency, etc. The central part of an NoC is the router. In a conventional NoC design, most of the router area is occupied by the buffers and the crossbar switch. Not surprisingly, these two components also consume the majority of the router’s power. Most of NoC research has been based on the conventional router microarchitecture in the areas of routing algorithm, resource allocation/arbitration, buffer design, etc. There has not been much work done on drastic router microarchitecture redesign. In this dissertation, a novel router microarchitecture design is proposed, which we call Omega, that treats the router itself as a small network of a ring topology. Omega is highly modular and much simpler than the conventional design. It does not use a large crossbar switch as in the conventional design; packet switching is done with simple muxes. Furthermore, the network packet latency is greatly reduced. Simulation and circuit synthesis show that the Omega microarchitecture can reduce latency, area and power by 53%, 34% and 27%, respectively, compared to the conventional design. The Omega microarchitecture design also provides opportunities to implement features that do not exist or are difficult to be realized in the conventional design. To demonstrate this, we implement a new feature on the Omega router to merge packets together in the buffer. The merged packets traverse the network together as long as their routes to destinations do not diverge. This greatly improves the buffer and link utilization. As a result, the effective network capacity can be substantially increased. 12

This dissertation presents one of the first efforts on the new microarchitecture for router considering packet merging. Additional characterizations can be done to better understand its potentials for various applications, and perhaps its shortcomings, in future work to push performance even further. 13

Chapter 1

Introduction

This dissertation proposes a novel router microarchitecture for on-chip networks that per- forms better than the conventional design in terms of network latency, circuit area and power. The new design is basically a small network inside the router that is highly modular and much simpler than the conventional design. This chapter describes how over the years, the computer architecture research community reached the multicore microprocessor architecture and the network approach to connect a large number of processing cores integrated on a single die that we have today.

1.1 Chip Multiprocessor

Despite the continual advances in semiconductor fabrication technology over the past several decades, Moore’s Law continues to hold quite well. Transistor size has been shrinking steadily, and Figure 1.1 shows how a number of key microprocessor characteristics have changed over the years. One aspect that stands out is the number of transistors that have been integrated in a microprocessor, which has been increasing exponentially over the years. As processor designers took advantage of the high availability of transistors and used higher and higher 14 clock rates, a physical limitation in power was reached [2]. It was getting more difficult to dissipate the tremendous amount of heat generated by the circuitry. In the past 20 years, the clock frequency could no longer be increased beyond the 3-GHz range without using exotic cooling techniques. To further system performance, microprocessor architects placed multiple simpler processing cores on the silicon die and ran them at a more moderate frequency. In addition, parallel programming was more widely adopted in application development to take advantage of the availability of multiple processing cores. In general, the more cores are used, the better performance can be achieved. Therefore, the number of processing cores that can be put on a single die has been increasing. Nowadays, it is not uncommon for an ordinary personal computer or a smart phone to have two to four cores, while commercial servers have 12 to 28 cores.

The aforementioned processors with multiple cores are commonly referred to as chip multiprocessors (CMPs). A typical CMP contains only a small amount of on-chip high-speed memory mostly used as caches, which are implemented as SRAM (Static Random-Access Memory). Many modern applications require gigabytes of memory to run efficiently. The large amount of memory can only be implemented cost-effectively as DRAM (Dynamic Random- Access Memory) [4,5]. The process technologies of SRAM and DRAM are quite different and cannot easily co-exist on the same silicon die; SRAM is optimized for performance whereas DRAM is optimized for cost. As a result, DRAM modules are integrated externally to CMPs in a system. SRAM and logic use the same process technology. Therefore, to improve the run-time performance of applications, processor designers make use of the abundance of transistors to integrate more and more processing cores and the appropriate amount of SRAM on a die. Such a situation can be observed in Figure 1.1, where the “Number of Logical Cores” plot starts to increase when the “Frequency” plot begins to level off. Today, in some special implementations, the number of cores in a CMP have even exceeded one thousand [6–8]. 15

Figure 1.1: 42 Years of Microprocessor Trend Data. Source: [3]. Notice that the vertical axis is in log scale. Original data up to the year 2010 were collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten. Data for 2010-2017 were collected and plotted by K. Rupp.

There are basically two types of CMP systems. In a heterogeneous system, the processing cores are different; they are dedicated to do specific types of computations. For example, it may have a few cores for general-purpose computing, a core for FFT, a core for encryption/decryption and a core for matrix multiplication. In a homogeneous system, all the processing cores are identical and work together in parallel to do the computation. In general, heterogeneous systems are targeted at the embedded applications, where workloads are few and well defined, and real-time response is important; homogeneous systems are geared toward intense computing applications, like weather forecasting and circuit simulation, where workloads are dynamic and may take many hours to finish. This dissertation focuses on homogeneous systems that have from a few tens of cores to over a thousand. Each core has 16

MemoryCoresCores

CoresCoresI/O Core CMP

Figure 1.2: Generic CMP System Configuration. its own program to execute independently under a global clock. Figure 1.2 shows the system of interest’s general framework. The CMP connects to a bank of off-chip memory modules, and it also communicates with the outside world through a subsystem of I/O units. In this work, we only target the interactions between the CMP and the memory, ignoring the I/O aspect for the time being.

1.2 Network-on-Chip

Modern chip die size is in the few hundred mm2 range [9]. The die size of the latest high-end Intel Skylake processor is 698 mm2, and the AMD Epyc processor is 756 mm2. Banks of gigabytes of off-chip memory are connected to the processor chip through high-speed bus. With such small die areas, there is a severe physical limitation in the number of contacts that can be built onto the chip to bring signals into and out of the chip. The “pin count” of a modern processor chip ranges from a few hundreds to a low thousands [10]. The Intel Skylake 17 processor has 2066 pins, while the AMD Epyc processor has 4096. Many of the contacts are for power and ground connections, and the data bus width is usually at least 64 bits wide. There are not many data channels that can be established at the chip interface. Given the large number of cores (in many tens or hundreds) inside a chip, the interface to the off-chip memory can be a communication bottleneck. Since many modern applications require vast memory to execute efficiently, it is critical to have a connection infrastructure that can bring the large amount of data to a large number of processing cores efficiently. The traditional bus approach to connect large number of cores on-chip would not work because of the bandwidth and latency issues. In the early 2000s, the very promising approach of building a packet-switching network on-chip (NoC) was proposed [11], and it has gained much traction over the years. Since the identical cores are laid out on a 2-D plane on the die, a regular planar mesh network is a sensible infrastructure to connect the cores and the off- chip resources together. It can achieve high bandwidth and low latency in communication. (We will address NoC in greater detail in Chapter2.) Components within the network communicate with each other by using packets. For example, a core sends a command packet to the memory controller to request a cache line, and some time later, it receives a data packet of the required cache line. Figure 1.3 shows an example of such infrastructure. The CMP has 16 processing cores on-chip connected by a 2-D network of routers. Each router has at most five connections or links, one to the core and four to the other routers on-chip or memory off-chip.

1.3 Motivation

In a router, buffers are used to hold packets as they traverse the network. Links are physical channels that connects routers, cores and other components in the network. Buffers and links usually have the same width for simplicity of circuit design. With abundance of transistors and wires on a die, NoC designers can adopt very wide buffers and links to achieve high 18

C0 C1 C2 C3

R R R R

C4 C5 C6 C7

R R R R Off-chip Main C8 C9 C10 C11 Memory

R R R R

C12 C13 C14 C15

R R R R CMP

C Core R Router

Figure 1.3: A network-on-chip connecting CMP and off-chip memory.

bandwidth; typical widths are 64 and 128 bits. There are essentially two types of packets in the network: control and data. Control packets carry command or control signals, and they can be quite short and may not occupy the entire buffer or link width. On the other hand, data packets (carrying a cache line, for example) usually take up the entire width. In some applications, packets traversing the network can be dominated by short packets. We looked at one particular parallel financial application, blackscholes. We used a cycle- accurate multicore simulator, gem5 [12], to run the application on a 8 × 8 mesh network and examined the packets injected into the network. The percentage of control packets was 19

120%

100%

80%

60% Utilization 40%

20%

0% 0 50 100 150 200 250 300 Link/Buffer Width (bits)

Figure 1.4: Link/Buffer utilization at various widths.

75% of all packets. Assuming short packets are 32 bits long, Figure 1.4 shows the link/buffer utilization at various widths. It shows that the typical widths of 64 and 128 bits only take full advantage of the available bandwidth at 66% and 44% of the time, respectively. That is, quite often, the buffers are holding empty information, but the links are still transferring them for no purposeful reason.

Improving the link/buffer utilization can be beneficial in several ways. The effective capacity of the network increases because more packets can be held in the network at a given time. This will be very helpful in heavy traffic conditions. Packets can enter the network 20 sooner and thus may arrive at their destination earlier than otherwise. In other words, the network packet latency can be reduced. If we increase the buffer utilization, we can use fewer buffers in the router to achieve the same level of performance as before. Buffers occupy the most area and are the most power consuming components in a router. Having fewer buffers can thus make the chip smaller and more power efficient. One obvious way to improve the buffer utilization is to make a buffer able to hold more than one packet. If a short packet is 32 bits long, a 64-bit wide buffer can hold two packets, and a 128-bit one can hold four. The potential to improve buffer utilization is quite large. Furthermore, if the packets being held in the same buffer are going in the same direction while traversing the network, they can travel together through links and buffers until they need to be split up. Therefore, putting packets together that are going in the same direction also improves link utilization. In our target system configuration, a packet entering a router can exit at five different ports (core, north, east, south and west). If we do not allow packet loopback, the number of exit points drops to four. That is, two separate packets entering a router have a one in four probability of going in the same direction. Furthermore, a smarter routing algorithm can improve these odds. Therefore, if we can merge packets going in the same direction, we should be able to improve the link/buffer utilization. In this work, we devised a scheme to allow short packets to be merged when they are in the same router and going in the same direction. Obviously, the scheme must also include a method to split merged packets when their routes start to diverge. After a careful examination, we determined that the conventional router microarchitecture does not facilitate such a packet merging/splitting scheme without making the circuitry much more complex. We propose a novel router microarchitecture, one that is drastically different from and simpler than the conventional design to realize the feature we wanted. We call this new router design Omega. An Omega router is basically a small network, a subnetwork of a bigger network that adopts a ring topology. Omega comes with an important benefit: 21

It eliminates the crossbar switch, a major component in the conventional design. This elimination saves much area and reduces power consumption. Omega also reduces the network packet latency and can use fewer buffers than the conventional design to achieve the same level of performance. We synthesize the conventional and Omega router circuitries with Cadence’s VLSI design tools. Area and power consumptions of Omega are shown to be much lower than those of the conventional design.

1.4 Contributions

The contributions of this dissertation are as follows:

1. The advantages of NoCs over the traditional bus-based approach for CMPs with high core count are addressed.

2. We propose the Omega router design, a novel router microarchitecture, for networks- on-chip. Simulation shows it reduces network packet latency and consumes less area and power compared to the conventional design.

3. A packet merging/splitting feature on the Omega router microarchitecture are realized to improve buffer/link utilization. Thus, the effective network capacity is increased.

4. The router circuitry is synthesized with commercial VLSI design tools to validate the Omega router design. The tools provide accurate circuit area and power estimations.

1.5 Dissertation Outline

The chapters of this dissertation are described below.

• NoC is a relatively new area in computer architecture; it has been in existence for about 16 years. Chapter 2 provides the fundamentals of the technology for those who are not 22

familiar with it and explains in detail why NoC has become the choice of interconnect for high-core-count CMPs.

• The Omega router is introduced in Chapter 3. We first describe in detail the con- ventional router microarchitecture, then the Omega microarchitecture is presented to highlight the major differences. How packets traverse inside the router and in the network are explained next. Finally, simulation results are provided to show the superiority of Omega over the conventional design.

• Chapter 4 provides the detailed implementation of the Omega router circuitry. In addition, operations of the major components (route computation unit, buffer, output arbiter and buffer arbiter) are explained.

• Chapter 5 describes the packet merging/splitting feature on the Omega router. We show the circuit implementation, and we present simulation results that demonstrate drastic improvement on network capacity. We also present the area and power measurements to show they are better than those of the conventional design.

• Chapter 6 draws some conclusions and describes future work that can be pursued. 23

Chapter 2

Network-on-Chip Basics

In this chapter, we describe the fundamentals of NoC. We also explain why NoC is a better approach than the traditional bus for CMPs in a number of ways at the architectural level and including providing additional advantages in circuit implementation. Being a relatively new technology, NoC is not without its challenges, which will be discussed.

2.1 Bus-Based Interconnect

To connect a number of processing cores together, one can adopt the traditional bus approach shown in Figure 2.1. While this approach is well established and well understood, it only works in a system with up to a few components and is unsuitable for CMPs with high core count. Each electronic component presents certain capacitance at its interface to the bus circuitry, which can be modeled roughly as a RC circuit, shown in Figure 2.2. There is voltage source (driver) providing a voltage, Vi, to a collectively resistive load, R, and a collectively capacitive load, C. The total resistance is primarily due to the conducting wire. The total capacitance, C, is the sum of capacitance of the conducting wire and all the 24

C C C C C C C C

Bus

C C C C C C C C C Core

Figure 2.1: A single bus connecting all processing cores.

attached components. In a microelectronic circuit, the conducting wire is extremely short, in µm range. As more components are added to the circuit (the bus), the increase in resistance due to increase in length of the conducting wire is so small that it can be ignored for the first-order effect. Therefore, R remains relatively constant. However, C increases linearly as more components are added, e.g., doubling the number of components doubles C. How fast a signal can rise or fall in a circuit is proportional to the time constant, τ, of the circuit [13], which is the product of its resistance and capacitance, i.e., τ = RC. The sum of the rise and fall times determines the minimum period of a clock signal that can be applied to a digital circuit. In other words, the maximum frequency of the clock signal to drive the bus decreases linearly with the number of cores added to the bus. As a result, as more cores are added to the bus, the clock rate applied may need to be lowered, thus degrading the performance of the system.

R

Vi C

Figure 2.2: A simple RC model of a circuit. 25

When the circuit load is increased, the circuitry to drive the entire load may need to be augmented to deliver the additional current required to maintain the same performance level. Consequently, the circuit becomes bigger and consumes more power.

Furthermore, a bus shared by multiple components requires an arbitration process to ensure that only one component drives the bus at any given time. This takes additional clock cycle(s) to complete. This extra step adds to the total bus transaction time, increasing the access latency of the resources attached to the bus.

In a modern microprocessor, where the multi-GHz clock signal is commonly used, a bus- based interconnect does not scale well with the number of cores. An alternative is needed for CMPs with high core count.

2.2 NoC

High-performance computing systems with multiple processors have been built since the 1960s [14]. Successful commercial implementations included systems from IBM. Those early- day systems were called mainframes. Their designs have evolved over the years, and many are still in service today. With tremendous advances in microprocessor technology since the early 1980s, most multiprocessor systems have been built with off-the-shelf microprocessors. The number of microprocessors integrated in a system has been increasing steadily over the years. Currently, systems with thousands of microprocessors are quite common; many of them can be found in research institutes and data centers. System architects have been using network approaches to connect all the microprocessors. As more processors are being integrated on-chip, it was prudent to adopt the same network approach to build an on-chip connection infrastructure. 26

On-Chip vs. Off-Chip

On the surface, an NoC may look very similar to an ordinary computer network. A computer node in an ordinary network looks like a core in the NoC; they both use packet switching as the communication mode, and they both connect to the network through a router. The characteristics of off-chip network and NoC are indeed quite different. In this discussion, off-chip networks include implementations at the board and chassis levels. Typically, at the board level, the connections are implemented as conducting wires edged on a printed circuit board. At the chassis level, the connections can be optical fibers or copper wires.

2.2.1 Wires

Wires in a microelectronic circuit can be constructed microscopically thin and narrow [15]. With the wire width in the tens of nanometers (nm), we can connect components on a silicon die with many wires, i.e., with parallel buses of very wide widths, like 64 or 128 bits. Wide widths mean high bandwidths because a great deal of information can be transferred in parallel in a single clock cycle. On the off-chip situation, as explained in Section 1.2, the number of chip contacts are quite limited. In addition, the wire width, at least in the range of tens of micrometers (µm), is much wider than those on-chip. It is impractical to build many wide-width parallel buses off-chip; consequently, the off-chip communication bandwidth is much lower than that of NoC.

2.2.2 Router

At the board level, the number of connections among router chips and processor chips are quite limited due to the physical constraints explained in Section 1.2. CMP with a built-in NoC has abundant wires available on-chip. Thus, a single router can have high-bandwidth connections with the core and other routers. At the chassis level, a router itself is a standalone switching unit that has many ports available for connections to many processor boards. The 27 physical connecting media are usually optical fibers and Gbit cables. Because of the flexible fiber/cable connections, unlike NoC that is restricted to a planar environment, there exists an extra dimension of freedom. There are many choices of network topologies available. The fibers/cables can be manually rerouted, or the router can be reprogrammed to form a different network topology. In the case of NoC, once the chip layout is determined, the connections among cores and routers are fixed, so the network topology typically cannot be changed.

2.2.3 Latency

Modern processing cores run on multi-GHz clocks. On-Chip routers can also run at the same rates. The communication latency must be in the range of ns in order to feed data to the cores as fast as possible. Once a signal gets off-chip, it becomes very difficult to run at a very high clock rate. Thus, the communication latency increases drastically to at least a few hundred ns for on-board situations and can even get to ms range in a local area network.

2.2.4 Power

With CMPs running at very high clock rates and the associated NoC providing very high bandwidth, the chip can run to very high temperature. Activities on the NoC can consume a significant portion of the power budget to the chip; Kahng et al. [16] reported that of the total power consumed by the cores and routers in an 80-core CMP, an impressive 28% is spent on communication alone. NoC reduces the requirements for long wires, as explained in Section 2.3.8. This helps reduce power and area consumptions. In the off-chip situation, since the bandwidth and clock rate are lower, power consumption due to communication does not have major impact on the power budget. 28

2.2.5 Area

Silicon die real estate is quite precious. It has direct impact on chip manufacturing cost, so silicon chips are made as small as possible. Therefore, routing on-chip wires is restricted to a very small area. Off-chip networks usually have much more space to deal with, either at the board or chassis level. As a result, area is not a concern in such situations.

2.3 Features

NoC has become a popular choice of interconnect for CMPs. It has evolved and borrowed certain implementation techniques from its off-chip counterpart and also has some important advantages over the bus-based interconnect. This section describes what they are and how they help improve system performance.

2.3.1 Bandwidth

Figure 2.3 is an example of NoC with a 4×4 mesh. Each core has an independent connection to a router. Each connection is essentially a high-bandwidth dedicated bus connecting only two components. The operations of each bus are not affected by any other “buses.” The same situation applies to the router-router connections. The aggregate bandwidth of the infrastructure is much higher than that of a single bus approach (shown in Figure 2.1), where there is only one connection shared by all the cores.

2.3.2 Scalability

As more cores are integrated on-chip, the NoC can expand on either dimension without affecting the performance of existing connections. This increases the aggregate bandwidth of the network. Therefore, the scalability of NoC is much better than that of a bus, which can only be effective for systems with up to a few cores. This is advantageous for NoC when at 29

C0 C1 C2 C3

R R R R

C4 C5 C6 C7

R R R R

C8 C9 C10 C11

R R R R

C12 C13 C14 C15

R R R R

C Core R Router

Figure 2.3: A 4 × 4 mesh network with four concurrent communication paths. The paths are marked by the red, purple, red and blue lines. least a few more cores are to be added for a new CMP design; the connection infrastructure does not need to be redesigned. If a bus were used, some redesign and verification efforts are warranted to ensure that it will still function properly.

2.3.3 Parallel Communications

NoC essentially provides lots of independent buses in the system. Communications among cores can happen simultaneously. Figure 2.3 shows four transactions happening in parallel: 30

Core8 sends a packet to Core2 (red line in the figure), Core3 to Core13 (blue), Core4 to Core7 (purple) and Core12 to Core10 (green). Unlike the bus approach, there is no arbitration process involved, which will incur additional delay. Therefore, NoC is an amicable infrastructure for running parallel applications.

2.3.4 Clock Frequency

In an NoC, as shown in Figure 2.3, the connections are all point-to-point—linking only two components. In the actual implementation, they are unidirectional, i.e., one component is always the driver, while the other is always the receiver. On the die, the length of the connections are very short in almost all situations, and there are only two components in the circuit. In other words, the resistance and capacitance of the circuit are low. As explained in Section 2.1, the time constant, τ, would be very small. That is, the circuit can be driven by a very high frequency clock. Therefore, it is not uncommon that the clock driving an NoC is in the multi-GHz range, similar to the processing cores.

2.3.5 Fault Tolerance

In an NoC, one core can have multiple paths to another core. As shown in Figure 2.3, for example, a packet from Core0 can take on many different paths to Core15 by going through different sequences of routers. One possibility is 0-1-2-3-7-11-15 and another is 0-4-8-12-13- 14-15; such path diversity can make the network fault-tolerant. Since there could be billions of transistors integrated on a die, the chance of device failure can be high, but an NoC can be made to deal with failures. If a path to the desired destination is no longer available, the network controller can select a functional path to transmit the packet. Fault tolerance is becoming a must-have feature on modern large microprocessors, and NoC facilitates its realization. 31

2.3.6 Power Consumption

With the point-to-point connections in NoC among components having low resistance and capacitance, the driver circuitry does not need to be able to deliver much current. At the component interface, the fan-out is just one. Unlike the bus-based interconnect, where adding more components requires stronger driver, drivers in an NoC can be kept very small and consuming little power.

2.3.7 System Integration

An NoC, like the one presented in Figure 2.3, can be visualized as a juxtaposition of a number of rectangular tiles in a regular manner. Each tile consists of the core, the attached router, and the set of wires linking it to the other tiles. Each tile has the identical interface to the network. From a system development point of view, the process of making a CMP becomes easier for debugging, verification, optimization and customization because there is only one interface to deal with. For example, as the network expands in size, there is no need to ensure the interface still works properly because the electrical characteristics remain exactly the same as before. Unlike a bus-based interconnect, adding or reducing components change the electrical characteristics that warrants new effort to ensure functionality.

2.3.8 Chip Layout

NoC is a highly modular and regular structure. Component placement and wire routing can follow a certain pattern and be repeated many times in the chip layout process. Also, each tile in an NoC is only connected to its surrounding neighbors, meaning there are many fewer long wires connecting components, compared to the bus-based counterpart. All these factors make the chip layout easier, and would cut down the chip development time. 32

2.3.9 Clock Distribution

All the activities in a microprocessor are synchronized by a system clock, whose signal is typically distributed in a tree structure to cover the entire chip; the drivers and wires could occupy a lot of chip real estate and take up a significant portion of the power budget [17]. NoC is amicable for a non-traditional synchronization scheme, Globally Asynchronous Locally Synchronous (GALS), which can greatly reduce the power consumed by the clock signal [18]. GALS allows operations among tiles in an NoC to be orchestrated asynchronously (globally) and within the tile, everything is synchronized by a much smaller clock tree (locally). As a result, the total clock structure is not as big as that in a traditional design. Clock slew challenges are also less severe, and potentially even higher clock rate can be used.

2.3.10 Packet Switching

As mentioned before, NoC borrowed a few implementation techniques from the off-chip networks. A major one is packet switching, which allows designers to implement high- level communication concepts at the hardware level that makes the usage of the network usage more efficient. For example, virtual channels, which are commonly used in NoCs, are implemented in routers to permit physical network paths to be shared by packets in a time division manner. This greatly improves the utilization of the network bandwidth.

2.3.11 Summary

In this section, we describe a number of important features of NoC, which indeed has many advantages over the bus-based approach for CMPs, especially with high core count. The advantages cover a wide range of aspects, from basic electrical characteristics to system development are summarized in Table 2.1. 33

Table 2.1: Summary of features of NoCs.

Feature Description Bandwith Very high aggregate network bandwidth Scalability Highly scalable Parallel communication Ideal for parallel applications Short point-to-point connections Clock frequency Low capacitance Can use very high clock rate Fault tolerance More than one path from source to destination Power consumption Low at the tile interface, fan-out of one All tiles use the identical interface System integration Easier for debugging, verification, optimization and customization Highly modular and regular structure Chip layout Very ordered component placement and wire routing Fewer long wires Can use Globally Asynchronous Locally Synchronous Clock distribution (GALS) scheme Much reduce clock skew issues Packet switching Facilitates high-level communication concepts

2.4 Key NoC Characteristics

In this section, we describe the important high-level characteristics of NoC. Topology, routing algorithm and flow control are the three key factors that determine network performance [19]. They are in many ways similar to those of the off-chip networks.

Packet Format

In an ordinary network, computer systems talk to each other by sending messages. At the application level, message size can be very large. For example, a message can be the contents 34 of a giga-byte file. As we go down in the architectural layers toward the hardware, the message needs to be broken down into a smaller, more manageable size. A message is split up as a sequence of packets. NoC deals with information at this granularity. The packet is further broken down into a sequence of flits (flow control units)[20], which contains all the information that can be transferred throughout the network in a single clock cycle. Typical flit sizes are 64 and 128 bits, and they are usually the same width as of the buffers holding them in the router and the links connecting the routers. A packet consists of one or more flits. As depicted in Figure 2.4, there are three types of flits: the head flit contains the control signals about the packet; the body flit contains the actual message contents from the application; and the tail flit indicates the end of the packet. Note that the flit of a single-flit packet can assume all three identities. As described in Section 1.3, there are control and data packets. Control packets do not carry any message contents from the applications; they are strictly used for carrying out operations in the network. Therefore, control packets are usually just one flit long and may not occupy the entire buffer or link width.

Message Packet Packet Packet

Packet Head Flit Body Flit(s) Tail Flit

Figure 2.4: Message format.

2.4.1 Topology

Network topology, usually a static characteristic determined during system design time, has a significant impact on the network performance in terms of cost and latency. Topology decides the possible paths that can be taken by a packet traveling from its source to its 35 destination. The actual path is then selected by the route computation unit of the router during run time. Topology determines how many links a router has with other routers—the radix. The higher the radix is, the more complex the router circuitry becomes, which impacts the cost of the components; the more complex means a higher cost. Topology also determines the link width and its bit rate (how fast the link can be driven). All these factors affect the network’s cost and bandwidth. Figure 2.5 shows a few common examples.

CP CP CP CP CP CP CP CP R R R R R R R R CP CP CP CP PC PC PC PC R R R Mesh R R R R Torus R CP CP CP CP CP CP CP CP R R R R R R R R CP CP CP CP CP CP CP CP R R R R R R R R

R

C C C C Fat Tree R R Ring R R R R

R R R R

C C C C C C C C

Figure 2.5: Network topology examples. 36

2.4.2 Routing Algorithm

Once a packet reaches a router, the routing unit uses a built-in algorithm to determine where to send the packet. There are numerous routing algorithms. In this section, we utilize two different basic routing algorithms that are commonly used to illustrate how routing may work in a 2-D mesh network. The head flit typically contains the coordinates of the destination router. There are five ports from which a packet can exit a router: core, north, east, south and west. With the destination coordinates, the routing algorithm decides the next router the packet needs to go to, which in turn determines what the exit port is.

Dimension-Ordered Routing (DOR)

Dimension-Ordered Routing [21] is the most popular and simplest routing algorithm in NoC, specifically designed for mesh networks and very simple to implement. A packet traverses one network dimension first before moving on to another dimension (if necessary) before reaching its destination. That is, there is at most one turn in the route. For example, in Figure 2.6(a), a packet wants to go from Node 7 to Node 2, so it first takes the path along the X dimension, 7-8. When it reaches Node 8, it takes the path along the Y dimension, 8-5-2. The selection of the path from the source to the destination made by DOR is fixed; it cannot be changed once the circuit is built.

Node-Table Routing

A routing table, which is a lookup table containing the exit port identities for all destination nodes in the network, is established in each router and can be preloaded during system startup. When a packet reaches a router, the route computation unit looks up the exit port using the destination node contained in the packet header. Using the same example above, when the packet reaches Node 7, as illustrated in Figure 2.6(b), the routing table there indicates to take the east exit port to reach Node 2; at Node 8, the exit port is north, and so 37

C C C 0 1 2 C C C 3 4 5 C C C 6 7 8 (a)

Packet Routing table at Node 7 Routing table at Node 8 Dest=2 Destination Next Hop Destination Next Hop : : 2 East 2 North . . . : : (b)

Figure 2.6: (a) Packet traverses from Node 6 to Node 2. (b) Routing tables. on. Since there are many possible paths from one node to another, a routing table provides the maximum flexibility on which path can be taken. It is also a common routing scheme used in ordinary computer networks.

2.4.3 Flow Control

Packets can come into a router from all directions or try to exit the router simultaneously. The packets compete for limited resources in a router to travel from the source node to the destination node. The resources include buffers for holding packets temporarily in the router and the bandwidth for transferring packets to the next router. Flow control consists of three basic mechanisms to manage the resources: arbitration to resolve contentions, allocation to select the resource properly, and throttling to prevent packets from overrunning the router downstream. 38

For example, in the case of more than one packet of different input buffers are ready to exit the router from the same port, so a designer may adopt a round-robin priority arbitration scheme to treat all different ports with equal fairness to access the exit port. Once a port is granted access, it will have the lowest priority in the next clock cycle to regain access. Virtual channels are commonly used to avoid earlier-arrived packets from blocking those coming in later but wanting to go to a different direction, an issue commonly referred to as Head-of-Line blocking [20]. These channels allow a physical path in the router to be used among different packets simultaneously. A buffer contains a number of virtual channels. As a packet traverses different routers, it may be allocated with different virtual channels, depending on the router conditions. A credit-based throttling scheme is commonly used to make a router aware of the buffer conditions downstream [20]. A router will only send a flit down the link when it has enough credits. It decrements its credit count upon sending a packet. Then, the downstream router returns a credit when it has finished processing a flit and is ready to accept another one. Upon receiving the credit, the current router increments its credit count.

2.5 Challenges

As seen in Section 2.3, NoC provides many benefits for CMPs. However, the relatively new technology does not come without its own challenges [22]. Management of the network is becoming a big issue as the size of NoC increases. How would one manage a thousand or more cores and routers effectively without sacrificing performance? More components means higher probability of device failure. Fault tolerance is no longer a design after-thought. What kind of fault-tolerance features should be implemented for the NoC? 39

With so many cores on-chip and limited number of I/O pins, it is challenging to make the cores access the off-chip resources efficiently. The off-chip resources are attached to some edge router(s) in the network. There exists an asymmetry in terms of access latency between cores physically close to the off-chip resources and cores far away from them. Design tools are still lacking. They are essential for exploring the design space to come up with the proper network topology, router microarchitecture, routing algorithm, etc. Standards are also lacking. Benchmark suites specifically designed for evaluating NoCs practically do not exist. IP blocks are still evolving, thus slowing down the adoption rate of the new technology. 40

Chapter 3

Omega Router

It is well known that the area and power consumptions of an NoC are dominated by the buffers and the crossbar switch [23]. In this work, we propose a novel router microarchitecture that is tailored for 2-D mesh networks. We name the microarchitecture Omega. The design of Omega is drastically different from the conventional design, for it does not have the crossbar switch as in the conventional design. In addition, it can use fewer buffers to achieve the same or better performance. In this chapter, we describe the Omega router microarchitecture in detail and present the simulation and circuit synthesis results to show its superior performance.

3.1 Conventional Router

NoC is a relatively new technology in the area of computer architecture research. The concept started in the early 2000s as more processing cores were integrated on a silicon die [11,24,25]. As illustrated in Section 1.1, the number of cores has been increasing rapidly. An efficient infrastructure was warranted to provide high throughput and low latency communication. 41

3.1.1 Microarchitecture

Figure 3.1 shows a conventional NoC router microarchitecture design. It evolved from its off-chip counterpart [19], which is a standalone chip implementation, not tightly integrated with the processing cores on the same die. There are five major functional units in the design: input unit, route computation unit, virtual channel allocator, switch allocator and crossbar switch. It is a low-latency design; a flit can go through the router in four clock cycles, typically just a few nanoseconds. The throughput is high; flits can enter and leave the router in every clock cycle simultaneously.

Credits to upstream Virtual Credits from routers Channel downstream Allocator routers Route Computation Switch Allocator

VC0

East East South VC1 South West West

North … North Core Core

VCN Crossbar Switch

Input Unit

Figure 3.1: Conventional router microarchitecture. 42

Input Unit

Figure 3.1 shows the conventional router design with five input/output ports for a 2-D mesh network. One port connects to the processing core, and the other four ports connect to the routers in the north, east, south and west directions. The figure shows a generic design and can be applied to various topologies that may have different numbers of input/output ports—different radixes. Flits entering an input port are received by a dedicated input unit and will be stored in a buffer. The buffer is organized by the virtual channels, which allow a network path in the router to be shared by a number of packets going in the same or different directions. The incoming flit contains a virtual channel number assigned by the router or core upstream. The flit is stored in the virtual channel. The input unit has only one exit point shared by all virtual channels in the buffer. Flits in different virtual channels compete for access to the exit point when they are ready to be sent to the router downstream.

Route Computation Unit

Before an incoming flit is written into the buffer, the router computation unit will use the destination node coordinates embedded in the flit to determine which exit port the flit will be sent to. Each input unit has its own route computation unit so that multiple incoming flits can be processed simultaneously in the same clock cycle.

Virtual Channel Allocator

A virtual channel allocator keeps track of the availabilities of different virtual channels in the router downstream. Pending flits in the current router compete for those available virtual channels. The virtual channel allocator assigns the available virtual channels to those pending flits according to an arbitration algorithm. The allocator uses a credit scheme for assignment. The number of credits corresponds to the availability of virtual channels in the router downstream. When an assignment is done, it decrements the credit count. After the 43

flit reaches the router downstream and is processed, a credit is returned back to the current router to indicate new availability. The credit count is then incremented.

Crossbar Switch

To direct flits from five input ports to any one of the five output ports, a crossbar switch is used. The switch is designed for providing maximum throughput. As long as there is no competition for the same output port, five transfers can take place simultaneously in the same clock cycle.

Switch Allocator

Pending flits with virtual channels allocated within an input unit compete to get access to the crossbar switch to leave the router. At the same time, they also compete with pending flits in other input units if they happen to want to leave the router from the same ports, i.e., flits entering different input ports desire to go to the same router downstream. To resolve all these conflicts, the switch allocator determines which virtual channel from what input unit has access to the crossbar switch using an arbitration algorithm.

3.1.2 Router Pipeline

In order to achieve the maximum throughput of one flit per clock cycle at each input or output port, a pipelining scheme is used in the router, as shown in Figure 3.2. The sequential operations described above are organized into four separate stages to execute. Each stage takes one clock cycle to complete. All four stages execute simultaneously but on different flits. The latency of the router is four clock cycles, but the throughput is one flit per cycle. A maximum of 20 flits (5 × 4, from five inputs and four stages) can move through the router simultaneously and in a sustainable fashion. 44

Only the head flit from a packet needs to go through all four stages. For the pure body and tail flits, the virtual channel and switch allocation stages can be skipped, and in the first stage, only buffering is done. There is no need to recompute the route for the non- head flits because all the flits in the packet must go to the same next router. The same situation applies to the virtual channel; all flits from the same packet must be kept in the same virtual channel. Therefore, virtual channel allocation can be skipped for the non-head flits. Furthermore, the entire packet cannot be broken up in time; the body and tail flits must follow the head flit in moving forward in consecutive clock cycles. Once the head flit is moving through the crossbar switch, the switch is kept “open” for the entire packet until the tail flit has passed through it.

Routing Virtual Switch Crossbar Computation Channel Allocation Traversal & Buffering Allocation

Figure 3.2: Router pipeline.

3.2 Omega Microarchitecture

We have mentioned in Section 1.3 that the conventional microarchitecture is ill-suited for our intention—to improve the buffer and link utilization by merging packets or flits. For clarity, Figure 3.3 shows just the datapath of packets in the conventional router. It is basically redrawn from Figure 3.1 with just two major units: buffer and crossbar switch. A flit entering the router is stored in a buffer. Then it competes with other pending flits (if any) in other input units, and in its own unit as well, to get access to the crossbar switch to leave the router. 45

Notice that the buffers are placed at the entry points of the router. Flits have the opportunities to merge there if the buffer has “spare space” and they are going in the same direction. Note that the second condition is necessary because the merged flits must be split up upon leaving the buffer if they go in different directions. That would waste the bandwidth in the crossbar switch that immediately follows. The only opportunity to meet up with flits from other input units is after the crossbar switch. (Meeting inside the switch is impractical as it does not have any storage space.) Consequently, the buffers placed at the front-end of the router have four times less opportunities to do any merging, unless much complex switching is introduced to the circuitry to handle the other four sources. Furthermore, additional buffers need be added after the crossbar switch if we want to do the flit merging there. This will increase the size of router significantly. Therefore, a drastically different design is warranted to serve our purposes.

C North

East West

Router Buffer Crossbar switch South

Figure 3.3: Conventional router microarchitecture datapath. 46

3.2.1 Top-Level Design

While the conventional router microarchitecture is designed for various types of networks, our proposed microarchitecture specifically targets the planar mesh networks, which typically do not have a large number of input/output ports. At the top level of this highly modular design, there are five identical modules, which we call them exchanges, linked together. This design is simpler than the conventional ones because the router is basically a small ring network of five exchanges, four for the neighboring routers and one for the core. Figure 3.4 shows the high-level design of the microarchitecture. We name this router Omega mainly because of the ring topology. However, it is not a perfect ring; it has a logical disjoint at the core exchange, which will be explained in Section 3.2.3. Thus, the network shape resembles the Greek letter Omega, Ω, which looks like a circle with a small gap.

C North

E E XC A B XC B A A West E XC East B A E B XC B A XC E Router

South Exchange

Figure 3.4: Omega router microarchitecture with a ring network of exchanges. 47

A flit entering the router will go through a series of exchanges before leaving the router. The arrangement of exchanges in the ring network minimizes the hop count (the total number of exchanges a flit needs to traverse) within the router for DOR, the prevailing routing algorithm in mesh networks. Packets traveling vertically or horizontally in the mesh network traverse only two exchanges, the minimum number of hops inside the router. Packets traveling vertically would involve only the north and south exchanges, and horizontally, the west and east exchanges. Using DOR, most of the time, packets go through the router either vertically or horizontally. They make a turn when changing dimension is needed, which happens at most once in the entire route.

3.2.2 Exchange

Each exchange has only three bidirectional ports, as shown in Figure 3.5(a). Ports A and B connect to the other exchanges in the router, and Port E (E stands for external) connects to another router or the core. Figure 3.5(b) shows a block diagram of the exchange design. A packet entering an exchange can only leave from one of the other two ports. Therefore, no loopback is allowed within the exchange. The no-loopback constraint is a reasonable condition to impose as there are really no good reasons (other than for hardware diagnostics) on why we want a flit to enter an exchange and immediately to return from it. By imposing the additional constraints, we have an overall simpler design than the conventional router. There is no large crossbar switch, only simple 2:1 muxes. Thus, chip area and power consumptions are reduced.

The exchange basically consists of three sections, and each section contains a buffer, a route computation unit, an output arbiter and a buffer arbiter. An input port associates, through a mux (a very simple switch), with a set of two buffers. For example, the input of Port A connects to the buffers of Ports B and E. 48

E

Output RouteRoute Route Buffer Arbiter Computation Buffer ComputationComputation Buffer Arbiter E A Buffer Buffer B A XC B Buffer Buffer

(a) (b)

Figure 3.5: (a) Exchange interface. (b) Exchange internal design.

Route Computation Unit

Each input port has its own route computation unit. Similar to its counterpart in the conventional router, the route computation unit determines the exit port in an exchange based on the destination node coordinates carried in the incoming head flit. Since there are only two choices, as opposed to five in the conventional router, the logic of the computation unit is much simpler. This helps reduce the chip area and power consumptions.

Buffer

The design of the buffers is the same as that in a conventional router, consisting of a number of virtual channels, which function the same way as in the conventional router described in Section 3.1.1. The main difference is in their locations in the module, which is unlike in the conventional router, where the buffers are at the front-ends of the router, the input units; 49

flits enter buffers first before passing through the crossbar switch. In the exchange of an Omega router, however, the buffers are at the back-ends of the module; flits pass through the muxes (essentially very simple switches) before entering the buffer. As we will see in Chapter5, the location of buffers being at the back-ends facilitates flit merging.

Buffer Arbiter

Each buffer in the exchange is associated with an input arbiter. As shown in Figure 3.5(b), there are two flit sources for a buffer. For example, the buffer in Port B is to be written with flits either from Port A or Port E. The purpose of the arbiter is to determine which port is granted access to the buffer in case there is a contention. The buffer arbiter essentially acts like a flow controller. As shown in Figure 3.6, a bus is shared among the arbiter and the two competing neighboring exchanges. However, in cases where the exchange is attached to the core, one of the two competing sources is the core. For example, when an upstream exchange wants to write to the buffer in Port E, it makes a request through the shared bus to Arbiter E. The arbiter determines whether Port A or B can be granted access and asserts the grant signal in the shared bus. When the upstream module receives the grant signal, it can then send out the flit. Otherwise, it will wait for its grant signal.

Output Arbiter

Each buffer in the exchange is also associated with an output arbiter, which determines when a flit in the buffer can be sent to the exchange (or core if the exchange is attached to the core) downstream. The arbiter looks at the arbitration shared bus to see if there is an asserted grant signal. If so, it selects the appropriate virtual channel in the buffer to output the flit. 50

E B A

Arbiter E E E A B B A Arbiter Arbiter A B

Figure 3.6: Exchange buffer arbiters.

3.2.3 Datapath

We expand Figure 3.4 to Figure 3.7 by adding the buffers in each exchange to show the datapath in the Omega router. Since all the links are unidirectional, there are actually two circular paths in the ring network. The red lines indicate a clockwise path, while the blue lines indicate a counter-clockwise path. Notice that there is an intentional logical gap built in at the core exchange. That is, packets entering the core exchange must go to the core; they will not be routed to other exchanges. The purpose of the gap is to prevent a cyclic dependency from forming (we will explain why this is necessary in Section 3.2.5). As a result, both circular paths are not completely circular; there exists a small gap as in the Greek letter Omega Ω.

As mentioned in Section 3.2.2, a loopback within an exchange is not allowed structurally. However, there is one case of loopback allowed at a higher level. Since the disjoint described above situates in the core exchange, flits from the core can be looped back to the core itself. 51

North E C E

B

A

B A A

E West East

B A E B

B A

Buffer E South

Figure 3.7: Omega router datapath.

However, it would take six hops in the router. A flit enters the core exchange, then goes to the north exchange, followed by the south, east and west exchanges, and finally, it returns back to the core exchange.

As seen in the figure, there is not a large central crossbar switch as in the conventional router (Figure 3.3). Without such a large component, an Omega router can be created that is more area and power efficient. All the buffers are evenly spread out across the entire router. 52

With the ring topology in the router, flit movements are much more dynamic than in the conventional router. Flits go from buffer to buffer when traversing the router, thus freeing up buffer space constantly, creating more opportunities for new flits to enter the network or router. In other words, the network traffic load would be more spread out among exchanges in the router.

3.2.4 Timing

Before a flit is written into a buffer, the route computation unit determines which exit port to take in the next exchange, not in the current exchange (the buffer location already determines the exit port from the exchange). Immediately after the route computation is done, a write request is made to the proper buffer arbiter in the exchange downstream, with the hope that the write access is granted for the next clock cycle. The buffers store the flits in a first-in, first-out (FIFO) basis. They have separate read and write ports and work under separate clocks as well. We use both edges of the clock—rising edge for read and falling edge for write. In the next clock cycle after a flit is stored in the buffer, if the buffer access is granted downstream, reading from the buffer happens in the first half of the clock. A flit will be output from the current exchange, and writing to the buffer downstream happens in the second half of the clock cycle. Figure 3.8 shows how the operations on a flit are spread over three connected exchanges. It shows the datapath of a flit from Exchange 0 (XC0) to Exchange 2 (XC2). The series of operations are done in two clock cycles. Assuming the “output arbitration” and “buffer read” operations are done in the previous clock cycle, the flit appears at the input of XC1 in the beginning of the current cycle. “Route computation” is carried out, then write request is made to XC2 downstream, where “buffer arbitration” is done for the next cycle. On the second half of the clock cycle (or on the falling edge of clock), “buffer write” is carried out to store the flit in the buffer of XC1. 53

XC0 XC1 XC2

VC0 VC0 VC0 … … …

VCN VCN VCN

Output Buffer Route Buffer Buffer Arbitration Read Computation Write Arbitration

Figure 3.8: Omega router buffer read and write operations.

Figure 3.9 adds a time dimension to show when and where the operations take place. Only “buffer write” occurs in the second half of a clock cycle, whereas all others happen in the first half.

3.2.5 Routing

To reduce network latency, lookahead routing is adopted for the Omega router. That is, when a flit enters an exchange, which one of the two buffers to enter has already been determined in the exchange upstream. In other words, routing is not done for exiting the router where the route computation unit resides. As explained in Section 2.4.3, a flit needs resources in each router to traverse the network. The same concept applies to traversing within an exchange because it is also a network. Routing flits from one exchange to another depends on resource availability in the exchanges downstream. A cyclic dependency may form under certain network traffic conditions. For example, an exchange is holding onto a buffer, and it needs a buffer downstream to move a flit forward; however, the same buffer being held is also needed by another exchange downstream. As a result, a deadlock occurs; the flit in the buffer cannot progress on its route to the destination [21]. 54

XC0 OA BR

XC1 RC BW OA BR

XC2 BA RC BW OA BR Exchange

XC3 BA RC BW

XC4 BA

0+ 0- 1+ 1- 2+ 2-

Cycle

Figure 3.9: Time-space diagram showing how a flit traverses a series of exchanges with five types of operations. Five operations: Route Computation (RC) determines the output port in the downstream XC; Buffer Write (BW) writes flit into buffer; Buffer Arbitration (BA) determines which upstream XC can gain access in the next cycle; Output Arbitration (OA) determines which virtual channel to output; Buffer Read (BR) reads flit from buffer. The flit is assumed to already be in the buffer of XC0 prior to Cycle 0. 0+ denotes first half of a clock, 0- second half, and so on.

To prevent cyclic dependency from forming in the exchange, there is a logical disjoint (created by the routing mechanism) at the core exchange. Consequently, only flits injected at the core port can be looped back to the same port in the router. For DOR algorithm, which changes direction at most once during the entire route, this disjoint imposes minimal impact on the performance. Furthermore, such disjoint also eliminates the occurrence of livelock [21], 55 where a flit is circled around indefinitely in a loop without actually move forward along the desired route. Livelock will not happen because there is no circular path inside the router. The consequence of the disjoint at the core exchange is that the shortest path between the north and west exchanges through the core exchange does not exist. Fortunately, the alternate path takes only one extra hop by going through the east and south exchanges. Since it is a change of dimension, which happens at most once in the route, the impact on latency is quite small. Table 3.1 shows how flits are routed through exchanges inside a router. Each hop takes one cycle. The minimum hop count is two, and most of hop counts are either 2 or 3. Only the north-west path takes 4 hops. The loopback from the core takes 6 hops.

Table 3.1: Routing inside a router.

Source Destination Stops Number of Hops Core Core North, South, East, West 6 Core North none 2 Core East West 3 Core South North 3 Core West none 2 North East South 3 North South none 2 North West South, East 4 East South none 2 East West none 2 South West East 3

Another adverse effect that can occur in the network is starvation [21]. This can occur when the resource arbitration scheme favors one requester over another. In a heavy traffic condition, the requests from the input port of one exchange can be so overwhelming that the buffer arbiter grants accesses only to this particular input and completely ignores requests from the other. A round-robin arbitration scheme is adopted for gaining access to the buffers 56

in the exchange downstream. Thus, starvation to any inputs to the router or exchanges will not occur.

3.3 Evaluations

In this section, we describe the tools and methodology to evaluate the performance of the Omega router. We augment a cycle-accurate NoC simulator, Booksim [26], to support the Omega router microarchitecture. We use various network configurations and different synthetic traffic loads as well as network traces from real-world applications. We measure the network latency, saturation and throughput using the simulator. Based on the conventional router microarchitecture [27], we implement various router circuitries to estimate the area and power consumptions and circuit delays using Cadence’s VLSI design tools. Besides the conventional router, we also add a more advanced conventional router for comparison.

3.3.1 Simulator

Booksim is a much enhanced version of the original simulator used to generate the data for the seminal book Principles and Practices of Interconnection Networks, by Dally and Towles [19]. The original version was designed for multiprocessor networks, where the networks are off-chip. The new version added support for NoCs and also includes many state-of-the-art features of NoCs. Booksim is a cycle-accurate simulator; it can model the behaviors of network components down to single-clock-cycle precision. Its accuracy has been validated with actual router circuitry implementation using high-level hardware design language Verilog. The software design is quite flexible in that users can build any network topologies, and it supports the conventional four-stage router pipeline. 57

Booksim is an open-source software, written in C++. It runs in the environment. Since its release to the public in 2013, computer architecture researchers have been updating the software; the latest update was June 27, 2017.

Booksim only models the network. It does not model the instruction executions of the core, unlike full-system simulators like gem5 [12] and Graphite [28]. Consequently, Booksim cannot execute an actual computer program, but users can inject network packets into the network in certain patterns. Alternatively, it can replay the sequence of network packet injections captured from running real-world applications on a full-system simulator or a real machine. Without the need to model the core precisely, the simulation execution can be quite fast, allowing NoC designers to explore the design space more thoroughly.

Booksim also models the area and power consumptions of an NoC at high level, so they are not as accurate as the timing measurements. Parameters from a 32-nm process technology model are used. An analytical model is used to estimate the area and power consumptions. The number of buffers, size of the crossbar switch, and number of connections are the primary factors to estimate the area and static power consumptions. During simulation, activities in various components are recorded, then they are used to estimate the dynamic power consumption.

We have made several major additions to the source code of Booksim for our work. We added new C++ classes for the Omega router and the exchanges. The conventional router uses a credit-based flow control mechanism (see Section 2.4.3), so we added a new shared arbitration bus design for the same purpose for the Omega router. We also created a new route computation unit as well as buffer and output arbiters for the Omega router.

Booksim is a single-threaded application that takes on inputs from a configuration file. At the end of a simulation, it outputs a number of statistics. Listing 3.1 shows an example of the configuration file. We describe some important parameters here. Booksim configures the network to have two dimensions (n = 2), with each dimension consisting of four nodes (k 58

= 4). The traffic pattern used is uniform (traffic = uniform), and the injection rate is 0.001 flit per clock cycle per node (injection rate = 0.001).

3.3.2 VLSI Design Tools

Even though Booksim provides estimates on the area and power consumptions of a router, they are rather high-level estimates nonetheless. In addition, the parameters used for the process technology are at least several years old. To obtain more accurate estimates, we implement the actual router circuitries with the latest VLSI design tools and measure the area and power consumptions. We also measure the circuit delays to get a feel of how fast we can drive the circuits. The conventional router microarchitecture, which we use as our baseline design, was implemented with Verilog by Becker [27]. We keep the buffer circuitry design in the baseline design and enhance it for the Omega router. We create new circuitries for the router computation unit, output and buffer arbiters. We implement new control logics to glue all major components together. We also create new test benches to verify the functionality of the Omega router. We use Cadence’s Genus Synthesis Solution product [29] to compile the Verilog code and synthesize the router with a 45-nm process technology cell library. The tool also provides accurate estimates on the area and power consumptions of the router. For simulation to validate the designs, we use Cadence’s Incisive Enterprise Simulator product [30]. Table 3.2 shows the primary parameters used in circuit synthesis.

3.3.3 Network Configurations

Since the Omega router is specifically designed for planar mesh networks, we run simulations on only those types of networks. We use a number of network configurations. Table 3.3 shows the primary network parameters used in simulations. Note that because of the drastic 59

Listing 3.1: Booksim configuration file example.

// Topology topology= mesh; k = 4; n = 2;

// Routing routing_function= dor;

// Flow control num_vcs = 8; vc_buf_size = 8;

// Router architecture channel_width = 128; router= iq; vc_allocator= islip; sw_allocator= islip;

// Traffic traffic= uniform; injection_rate = 0.001;

// Simulation sim_type= latency;

// Power estimation sim_power = 1; 60

Table 3.2: Circuit synthesis parameters.

Parameter Value Technology node CMOS 45 nm VDD 1.2 V Cell library Cadence GSCLIB045 Number of metal layers 11 Clock frequency 1 GHz difference in architecture, the Omega routers use two or three virtual channels, whereas the baseline routers use eight or nine.

Table 3.3: Network parameters.

Parameter Value Dimensions 8 × 8 Packet length one flit Buffer width 128 bits Link width 128 bits Number of virtual channels 2,3,8,9 Virtual channel depth 8 flits Routing algorithm DOR

3.3.4 Network Traffic

We want to measure accurately the performance in the target conditions, so we only collect data when such conditions are reached. There are basically three phases in a simulation: warm-up, steady-state and draining. During the warm-up phase, the network starts to be injected with packets, i.e., it may take some time to reach the target conditions. In the steady-state phase, the target conditions are reached. That is when we start collecting data. After many millions of simulated clock cycles, assuming we have injected enough packets for 61 measurements, we stop collecting data. Finally, the simulation enters the draining phase to drain out the packets that still remain in the network. In the draining phase, we maintain injecting packets the same way to ensure the packets injected in the steady-state phase still experiencing the same network conditions. After the last packet to be measured has reached its destination, we stop injecting packets. For every packet that is injected into the network in all three phases, we ensure that it has reached its desired destination to guarantee the proper functioning of the network and the simulator. In terms of how packets are injected into the network—the network traffic load, we primarily use two methods: synthetic traffic and network trace. Synthetic traffic generates a well-designed injection pattern, and network trace recreates the injection pattern that was collected from running an application program.

Synthetic Traffic

Each core injects packets into its attached router to reach some destination cores. Which destination cores to send to is determined by a (typically) simple algorithm. The rate of flit injection is an input parameter to the simulator. Each core will use the same algorithm to effect the injection rate. When to inject is done according to a Poisson distribution, which is widely used to model the arrivals of packets at the routers in computer networks [31]. A general traffic pattern will thus evolve across the entire network, so the pattern is synthetically created, not a natural result of running any real-world applications. However, some of the patterns do mimic closely how cores communicate in real-world applications. The main purpose of using synthetic traffic is to stress the network so that designers can characterize the router and the network. It is also a good way to debug and test the network as bugs can be reproduced consistently. The algorithm to generate the destination of a packet is the primary factor to produce the traffic pattern. We name the specific traffic patterns according to the algorithms used. 62

The way we number the network nodes is illustrated in Figure 2.3. Core0 is at the upper-left corner, and the number increases from left to right in a row as well as top to bottom from row to row. To evaluate the Omega router, among a large number of synthetic patterns, we choose six that represent a diverse range of patterns and are commonly used: asymmetric, bitcomp, hotspot, shuffle, transpose and uniform. They are described in detail below. asymmetric There are only two destination cores, one at the mesh corner and the other at the middle section of the network. The network traffic is skewed towards those two nodes. bitcomp The destination node is determined by complementing the bit pattern of the

source node number. For example, Core0 will always send packets to Core15. hotspot Destination can be assigned to specific nodes. Thus, heavy traffic hotspots would form, and network traffic congestions would occur around those nodes. shuffle The bit pattern of the source node number is rotated by one bit to create the destination node. This pattern arises from some sorting applications. transpose The destination node address is converted from the source node address using the operation of transposing elements in a matrix. This pattern arises from fast Fourier transform (FFT) operations. uniform This is the most widely used pattern [19], where each core generates the same amount of network traffic over time, so the spatial distribution of packets is uniform across the network.

Network Trace Traffic

While the synthetic traffic stresses the network and helps provide insights into the systems under design, the true performance may not show what would be seen when running real- world applications. Ideally, we want to run the applications that we will eventually use to evaluate the systems we are building. Unfortunately, at the moment, such effort is 63 quite impractical. Running real applications on a full-system simulators is extremely time consuming, with a run time on the order of many weeks [32]. Current technology simply does not permit this method to be effective. If our main focus is on the NoC alone, some important compromises can be made to speed up the simulation process with real-world applications. We can concentrate on the network activities and ignore the detailed behaviors of the cores that execute the program instructions. We adopted the Netrace methodology [32], which allows running simulations on an NoC with just network traces (or activities) generated by the application programs. The network traces are captured from executing the applications on a full system. The traces contain information on when a packet is injected into the network, the size of packet and from what core. They also contain dependency information among packets, for some packets can only be issued only after certain packets are received. This is a common scenario in the cases of memory access through a network. For example, a core receives some data from the main memory, does some processing, then requests more data. These activities produce a number of packets in the network. The order of the packet injections must be maintained in order to achieve more accurate simulations. Netrace provided us a set of network traces from the PARSEC application benchmark suite [33], which is a diverse set of real-world parallel applications designed for running on multicore systems. The traces were captured from running the applications on a 64-core system in a Linux environment. We used Booksim to utilize those network traces to run simulations on different router designs. Typically, there are hundreds of millions of records of packet injections generated during the execution of an application. The simulator would inject a packet into the network according to the injection timestamp (in clock cycle) in the record and its dependency on other packets. The following nine applications are in the PARSEC suite. blackscholes A financial application that computes the prices of a set of stocks using the Black-Scholes partial differential equation (PDE). 64

bodytrack A computer vision application that tracks a human body with inputs from multiple cameras. canneal It is used in chip design to reduce the routing cost using cache-aware simulated annealing (SA) method. dedup A computing kernel that compresses data stream at the global and local levels—a process to “deduplicate.” ferret It is a similarity search application based on the Ferret toolkit. fluidanimate A simulation application on incompressible fluid for animation purposes. swaptions A financial application that computes the prices of a set of swaptions, which are options giving the right but not the obligation to engage in a swap. vips An image processing application that uses affine transformation and convolution. x264 An application for lossy video compression based on the H.264 compression standard.

3.3.5 Simulation Platform

We run the simulations on the computer systems at the University of Arizona’s High Performance Computing (UA HPC) research center. There are two major computing clusters available, Ocelote and El Gato. We choose to use the Ocelote cluster, for it is the more powerful one. There are 400 compute nodes connected by Infiniband [34], a high throughput and low latency connection technology. Each node has two 14-core Intel Xeon Haswell E5- 2695 or Broadwell E5-2695 processors, running at 2.3 GHz, so each compute node has 28 physical cores. Each node has 192 GB of memory. The is CentOS [35], an enterprise-class Linux distribution. Ocelote is shared by a community of researchers and students. Typically, a user submits a computing job to the system and waits for a completion notification, then checks the results stored in files. The workloads are managed by software called Portable Batch 65

System (PBS) [36], which schedules jobs to be done on the compute nodes and monitors their executions. Users submit a script to PBS to initiate the work. Listing 3.2 shows a simple example of the submission script, which asks for one core and 4 GB of memory (select=1:ncpus=1:mem=4gb) to execute the program booksim (booksim config baseline). It also limits the total run time to 12 hours (walltime=12:00:00). All the screen outputs will be captured in a text file.

Listing 3.2: PBS job submission script example.

#!/bin/bash #PBS -N sim_job #PBS -M [email protected] #PBS -m bae #PBS -W group_list=meilingw #PBS -q windfall #PBS -j oe #PBS -l select=1:ncpus=1:mem=4gb #PBS -l walltime=12:00:00 #PBS -l cput=12:00:00 cd ~/projects/booksim booksim config_baseline

3.3.6 Running Simulations

We are mostly interested in seeing the performance of the NoC being measured over a range of injection rates, which are directly related to the traffic density in the network. One simulation session (PBS job) produces one data point for a particular injection rate. To collect one set of data (typically 30 to 50 points) for a particular network configuration requires running the simulator multiple times. Each data point may take hours of simulation to produce. The large number of available compute nodes at UA HPC allow us to run those simulations simultaneously because they are independent of each other. The vast amount of resources 66

help greatly in reducing the data collection time. We submit a large number of jobs with different inputs (configuration files) to PBS simultaneously in an attempt to run them in parallel as much as possible. To streamline the process of submissions, we wrote a Python script submit sims.py based on a few command-line parameters to automatically generate the configuration files and the submission scripts, and then submit the jobs to PBS. For example, the following command would submit nine jobs, each using a different injection rate (-i), on an 8 × 8 mesh network (-c 64), on a conventional router microarchitecture (-l) with eight virtual channels (-v 8). The results are stored in a subfolder (-p BASELINE).

python submit_sims.py -i "0.001,0.002,0.005,0.01,0.02,0.05,0.1,0.2,0.5" -c 64 -p BASELINE -l -v 8

3.4 Experiments and Results

We use three router designs in our experiments. The first one is a conventional design without any enhancements—the baseline. The second one is a design still based on the conventional design but enhanced with lookahead routing and speculative resource allocation [37]. Those two features help reduce delay in the router. They had been used on standalone routers in multiprocessor systems and were implemented on Booksim a few years ago for NoC. We found the combination producing the smallest network latency we could find out of various conventional router configurations available on Booksim. The third is our Omega router. For referencing purpose, we denote the conventional design without any enhancement as base-1, the conventional design with advanced features as base-2 and the Omega router as omega. Buffer is a critical resource in a router, and its size has significant impact on the performance of the router in different aspects. The buffer size is primarily determined by the number of virtual channels it contains. To have a fair comparison among the different routers, we attempted to keep the overall buffer resources the same. We also varied the buffer 67

size to see how the changes impact the performance. In the conventional router, we adopted eight and nine virtual channels per buffer in the input units. Since there are five input units, the total number of virtual channels are 40 (5×8) and 45 (5×9), respectively. So, we further denoted the conventional router having eight virtual channels as base-1-8, and nine virtual channels as base-1-9. An exchange has three buffers, one per port. There are five exchanges in the Omega router. We use two and three virtual channels per buffer, i.e., the total number of virtual channels in the Omega router is either 30 (5 × 3 × 2) or 45 (5 × 3 × 3). Similarly, we further denote Omega router with two virtual channels per buffer as omega-2, and three virtual channels as omega-3. Notice that base1-9, base2-9 and omega-3 have the exactly same amount of buffer resources.

3.4.1 Synthetic Traffic

For synthetic traffic workloads, we varied the injection rates to measure the number of clock cycles it takes for each packet to travel from the source node to the destination node. We gathered data from six different routers: base-1-8, base-1-9, base-2-8, base-2-9, omega-2 and omega-3.

Figure 3.10 shows the average network latency results from all traffic patterns over a wide range of injection rates. Note that the vertical axis is in log scale. Network latency is measured from the time a packet is ready to be injected into the network from the source core to the time it is received in the destination core, so the latency includes the queuing delay before the packet enters the network.

Figure 3.11 shows the same results but highlights the range up to the saturation point, where there is significant queuing delay, i.e., the network becomes saturated with packets.

Figure 3.12 shows the average network throughput results from all traffic patterns over the same range of injection rates as for the latency measurements. Network throughput is 68 measured as the number of packets ejected by the network to the destination core per cycle after the packets have traversed the network starting from the source core.

3.4.2 PARSEC Applications

For application traffic workloads, we rely on data from the external source [32] mentioned above. We measure only the network latency but no saturation because we cannot vary the injection rates as they are already predetermined by the network traces. We inject packets into the network according to the sequences and dependencies in the traces. We measure the number of clock cycles it takes for each packet to travel from the source node to the destination node. Three router designs are used: base-1-8, base-2-8 and omega-2. Figure 3.13 shows the average network latencies across all applications. They are the results of running simulations for at most 240 hours, the maximum allowable run time on UA HPC. Six of the applications running on the Omega router did not complete the entire simulation. As shown in Table 3.4, not all packets in the traces are injected into the network. Still, at least tens of millions of packets are successfully injected and processed in the simulations. Even though the simulations did not process all the packets from the network traces, we did notice that the latencies had reached a steady value for many hours. Therefore, we believe those data are reliable.

3.4.3 Circuit Synthesis

We measure the area and power consumptions as well as timing slack from just one single router circuitry, not the entire network. The area is the total area occupied by all the instances of library cells and wires in the router. The power is the sum of the leakage power and the dynamic power. The leakage power is static because it is not related to any activities in the circuitry; it is caused by the leakage 69

10,000 100,000

base-1-8 base-1-9 10,000 base-1-8 1,000 base-2-8 base-1-9 base-2-9 base-2-8 omega-2 1,000 base-2-9 100 omega-3 omega-2 omega-3 100

10 10

1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) asymmetric (b) bitcomp

10,000,000 100,000

1,000,000 10,000 100,000 base-1-8 base-1-9 base-1-8 1,000 10,000 base-2-8 base-1-9 base-2-9 base-2-8 1,000 omega-2 base-2-9 100 omega-3 omega-2 100 omega-3 10 10

1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (c) hotspot (d) shuffle

100,000 100,000 base-1-8 base-1-9 10,000 10,000 base-2-8 base-2-9 base-1-8 omega-2 1,000 base-1-9 1,000 omega-3 base-2-8 base-2-9 100 omega-2 100 omega-3

10 10

1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (e) transpose (f) uniform Figure 3.10: Average network latencies from 6 synthetic traffic patterns, full range. Y-axis: Latency in cycles. X-axis: Injection rate in packet/cycle. 70

50 100

45 base-1-8 90 base-1-8 base-1-9 base-1-9 40 80 base-2-8 base-2-8 35 base-2-9 70 base-2-9 30 omega-2 omega-2 omega-3 60 omega-3 25 50 20 40 15 10 30 5 20 0 10 0 0.1 0.2 0.3 0.4 0.5 0 0.05 0.1 0.15 0.2 0.25 (a) asymmetric (b) bitcomp

50 50 base-1-8 45 base-1-8 45 base-1-9 base-1-9 40 base-2-8 40 base-2-8 base-2-9 base-2-9 35 omega-2 35 omega-2 omega-3 30 30 omega-3

25 25

20 20

15 15

10 10 0 0.005 0.01 0 0.05 0.1 0.15 0.2 (c) hotspot (d) shuffle

70 70

base-1-8 base-1-8 60 base-1-9 60 base-1-9 base-2-8 base-2-8 base-2-9 50 base-2-9 50 omega-2 omega-2 omega-3 omega-3 40 40

30 30

20 20

10 10 0 0.05 0.1 0.15 0 0.1 0.2 0.3 0.4 (e) transpose (f) uniform Figure 3.11: Average network latencies from 6 synthetic traffic patterns, up to saturation. Y-axis: Latency in cycles. X-axis: Injection rate in packet/cycle. 71

0.50 0.25 base-1-8 0.45 base-1-9 0.40 0.20 base-2-8 base-2-9 0.35 omega-2 0.30 0.15 omega-3 base-1-8 0.25 base-1-9 0.20 0.10 base-2-8 0.15 base-2-9 0.10 omega-2 0.05 omega-3 0.05 0.00 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) asymmetric (b) bitcomp 0.018 0.35 0.016 0.30 0.014 0.25 0.012

0.010 0.20

0.008 base-1-8 0.15 base-1-8 base-1-9 0.006 base-1-9 base-2-8 0.10 base-2-8 0.004 base-2-9 base-2-9 omega-2 omega-2 0.002 0.05 omega-3 omega-3 0.000 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (c) hotspot (d) shuffle 0.30 0.40

0.35 0.25 0.30 0.20 0.25

0.15 0.20 base-1-8 base-1-8 base-1-9 base-1-9 0.15 base-2-8 0.10 base-2-8 base-2-9 0.10 base-2-9 omega-2 0.05 omega-2 omega-3 0.05 omega-3 0.00 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (e) transpose (f) uniform Figure 3.12: Average network throughputs in all synthetic traffic. Y-axis: Throughput in packet/cycle. X-axis: Injection rate in packet/cycle. 72

45

40

35

30

25 base-1-8 20 base-2-8 omega-2 15

Network Latency (cycles) Latency Network 10

5

0 blackscholes bodytrack canneal dedup ferret fluidanimate swaptions vips x264 average Applications and overall average

Figure 3.13: Average network latencies from PARSEC applications. Note: The last set of columns represents the average for all applications.

current in the devices. The dynamic power is due to the constantly running clock applied to the circuitry for synchronization purposes, and it is directly proportional to the clock frequency. Notice that the dynamic power is measured when the circuit is totally idling, i.e., no packets are active in the router. The circuit synthesis tool (Cadence’s Genus) can only provide the power measurement under “static” conditions as it cannot be used to simulate any packet movements.

The timing slack is how much margin the critical path of the circuit has with respect to the clock rate (specified in Table 3.2) being used during circuit synthesis. A negative number means the timing requirement is not satisfied, i.e., the delay of the critical path of the circuit is too long given the clock rate. 73

Table 3.4: PARSEC simulation completion status on Omega router.

Application Number of packets in traces Percentage injected blackscholes 22,660,091 100% bodytrack 385,863,891 94% canneal 372,046,797 12% dedup 431,833,996 65% ferret 287,425,404 47% fluidanimate 71,607,952 100% swaptions 310,331,287 100% vips 334,870,995 57% x264 135,412,156 50%

We gathered data from six different routers: base-1-8, base-1-9, base-2-8, base-2-9, omega- 2 and omega-3. Table 3.5 shows the results for various router designs.

Table 3.5: Router area and power consumptions and slacks.

base-1-8 base-1-9 base-2-8 base-2-9 omega-2 omega-3 Area (μm2) 746,032 865,927 751,394 872,155 488,890 692,445 Power (nW) 183,919,185 207,345,977 186,439,423 211,316,883 134,561,900 182,706,326 Slack (ps) 257 243 45 18 84 21

3.5 Analyses

In this section, we compare various performance metrics on different router designs from the data collected in the previous section. We focus on network latency, saturation and throughput as well as router power and area. We will explain or conjecture the reasons behind what we observed. 74

3.5.1 Network Latency

The horizontal part of a latency-load plot (as in Figure 3.11) extended to intersect the vertical axis shows what the no-load network packet latency is. No-load latency is essentially the delay a packet would experience if there are no other packets traversing the network at the same time. We use the data in Figure 3.11 to compute the no-load latency by averaging the first 10 points on the horizontal line. Figure 3.14 shows the computed latencies from all synthetic traffic patterns and the overall average. From those results, we compute the latency reductions of all router designs over the baseline design (base-1-8 ). The results are shown in Table 3.6.

50

45

40

35

30 base-1-8

25 base-1-9 base-2-8 20 base-2-9 omega-2 15 omega-3

Network Latency (cycles) Latency Network 10

5

0 asymmetric bitcomp hotspot shuffle transpose uniform average Synthetic Traffic

Figure 3.14: Average network latencies from 6 traffic patterns. Note: The last set of columns represents the average over all patterns.

As shown in Table 3.6, we do not see any latency reduction from base-1-9 over the baseline (the entire column is zeroes), which shows that having more buffers will not help reduce the 75

Table 3.6: Network latency reduction over the baseline design (base-1-8). Data are computed from Figure 3.14. Overall averages are also listed. Traffic base-1-9 base-2-8 base-2-9 omega-2 omega-3 asymmetric 0% 35% 35% 44% 44% bitcomp 0% 38% 38% 55% 55% hotspot 0% 36% 36% 52% 52% shuffle 0% 37% 37% 53% 53% transpose 0% 37% 37% 52% 52% uniform 0% 38% 38% 54% 54% average 0% 37% 37% 53% 53%

Table 3.7: Number of virtual channels in router designs.

base-1-8 base-1-9 base-2-8 base-2-9 omega-2 omega-3 Total number of VCs 40 45 40 45 30 45 % Increase over baseline 0% 12.5% 0% 12.5% -25% 12.5% latency. That is because under “no load” condition, almost all the buffers are idling anyway; adding more buffers effects just more idling components. We observe the same situation in base-2-9 over base-2-8 and omega-3 over omega-2 ; the numbers of the pairs are identical.

The reductions from pattern to pattern are quite even. For base-2, the range is from 35% to 38% (average=37%); the maximum difference is only 3%. For omega, the range is from 44% to 55% (average=53%). However, the 44% (from asymmetric) is somewhat an outlier. Without it, the range is from 52% to 55%; the maximum difference is identical to that of base-2. In general, the latency is not sensitive to traffic pattern, which is expected because by definition, the no-load latency means that there are other packets involved, i.e., there is actually no specific traffic pattern.

Omega outperforms base-1 or base-2 in all categories. Note that omega-2 has only 30 virtual channels, whereas base-1-9 has 45 (see Table 3.7). This shows that the Omega 76 microarchitecture has significant positive impact on the latency. Overall, it outperforms the baseline design by 53% and the baseline design with advanced features by 16% (53−37).

Using the data from Figure 3.13 from the PARSEC applications, we compute the latency performance improvements of the routers over the baseline design. The data are shown in Table 3.8. The results are very similar to those in synthetic traffic; there are no large differences from application to application. For base-2, the range is from 33.2% to 34.8% (average=34.0%), and omega from 46.3% to 50.1% (average=48.6%). Omega outperforms base-1 and base-2 in all applications. It outperforms the baseline design by 48.6% and the baseline design with advanced features by 14.6% (48.6−34.0). We observed that the average packet injection rates from all applications were actually quite low. Therefore, the results match quite well with those in the synthetic traffic.

Table 3.8: Network latency reduction over the baseline design (base-1-8). Data are computed from Figure 3.13. Application base-2-8 omega-2 blackscholes 33.6% 48.5% bodytrack 33.7% 48.8% canneal 35.0% 50.0% dedup 33.9% 47.6% ferret 33.6% 49.2% fluidanimate 33.5% 48.2% swaptions 34.8% 50.1% vips 33.2% 46.3% x264 34.3% 48.9% average 34.0% 48.6% 77

3.5.2 Network Saturation

The sharp rising part of a latency-load plot (as in Figure 3.11) indicates where the network saturates. The saturation point is defined as where the network packet latency is two times or more of the no-load latency. From the data in Figure 3.11, we extract the injection rates at the saturation points for various traffic patterns. The results are shown in Figure 3.15. From those results, we compute the network saturation increase over the baseline, shown in Table 3.9.

0.50

0.45

0.40

0.35

0.30 base-1-8

0.25 base-1-9 base-2-8 0.20 base-2-9 omega-2 0.15 omega-3

0.10

Saturation Point (flits/cycle/node) Point Saturation 0.05

0.00 asymmetric bitcomp hotspot shuffle transpose uniform average Synthetic Traffic

Figure 3.15: Saturation points from 6 traffic patterns. Data are extracted from Figure 3.11. Note: The last set of columns represents the average over all patterns.

The number of virtual channels, i.e., total buffer size, has direct impact on the saturation metric. All things being equal, across all router designs, a larger buffer means higher saturation point. We see that in Figure 3.11, the vertical part of a curve gets pushed out 78

Table 3.9: Network saturation increase over the baseline design (base-1-8). Notes: Data computed from Table 3.15. The number of virtual channels (VCs) over the baseline design are also listed for ease of comparison. Traffic base-1-9 base-2-8 base-2-9 omega-2 omega-3 asymmetric 10% 10% 10% -10% -5% bitcomp 10% 10% 10% 20% 20% hotspot 0% 0% 0% 0% 0% shuffle 10% 0% 0% 0% 0% transpose 17% 17% 17% 17% 17% uniform 12% 6% 12% -6% 6% average 11% 8% 9% 0% 5% % VC increase 12.5% 0% 12.5% -25% 12.5%

further to the right. From Table 3.9, base-1-9 outperforms base-1-8 on average by 11%, base- 2-9 outperforms base-2-8 by 1% (9 − 8) and omega-3 outperforms omega-2 by 5% (5 − 0). If we look at individual traffic patterns, it shows that having a larger buffer produces at least no worse performance consistently. These results should come as no surprise. A larger buffer allows more packets to be contained in the network, increasing the capacity of the network, which makes the network saturate later, rather than sooner.

It is worth noting that the performance increase is not proportionate to the buffer size increase. If we look at the averages in Table 3.9, for base-1 and base-2, the same amount of increases (12.5%) produces 11% and 1% (9 − 8) gains, respectively; for omega, 50% increase (from 30 to 45 virtual channels) produces only 5% gain. Furthermore, with the exact same buffer size, base-1-9, base-2-9 and omega-3 produces 11%, 9% and 5% performance increase, respectively.

Although buffer size is a major factor influencing saturation, it is not the only one. As shown in Table 3.9, the saturation point is sensitive to the traffic pattern. Within the same design, the performance increase can vary quite a lot from pattern to pattern. Base-1-9, 79

base-2-8 and base-2-9 all range from 0 to 17%, omega-2 -10 to -20%, and omega-3 -5% to 20%. Base-1-9 actually performs best on average and relatively more consistently across the patterns. Traffic pattern hotspot seems to be immune to the buffer size for saturation. There is no change at all across all designs. However, because of the very low injection rate at saturation point, perhaps the resolution of injection rate is not high enough to show some meaningful difference. Omega-2 has the smallest number of virtual channels, 25% less than the baseline design (see in Table 3.9,). However, it performs as well as the baseline design on average. This can be attributed to its architectural design differences that largely offset the large reduction in buffer size. With the ring topology in the router, the flit movement is much more dynamic than in the conventional router. Packets go from buffer to buffer when traversing the router, freeing up buffer space constantly and creating more opportunities for new packets to enter the network or router. In other words, the traffic load would be more spread out among exchanges in the router. All these suggest that the architectural aspects of the design play a bigger role than buffer size on the performance. They can overcome the vast difference in resources to deliver equal or better performance. Exactly how the correlation works needs to be better understood.

3.5.3 Network Throughput

Combining Figures 3.11 and 3.12, which show saturation and throughput, respectively, we can see that all router designs are able to maintain the same input and output throughputs prior to the saturation points. As for the throughput beyond the saturation point, there is no clear winner design here. It all depends on the traffic pattern. For example, omega-2 is the second worst performer (the close worst is base-1-8 ) on uniform but it is one of the top two performers (the other is omega-3 ) on shuffle. The relationship between network behaviors 80

and traffic patterns at high-traffic conditions appears to be a complicated one and warrants further investigation.

3.5.4 Area and Power

We compute the area and power increases of various routers over the baseline design. Table 3.10 shows the increases in percentage.

Table 3.10: Router area and power consumptions over the baseline design (base-1-8). Note: Data are computed from Table 3.5. The number of virtual channels (VCs) over the baseline design are also listed for reference. base-1-9 base-2-8 base-2-9 omega-2 omega-3 Area 16.1% 0.7% 16.9% -34.5% -7.2% Power 12.7% 1.4% 14.9% -26.8% -0.7% % VC increase 12.5% 0% 12.5% -25% 12.5%

As the total buffer size increases, the area and power consumptions increase almost proportionally. As shown in Table 3.10, a 12.5% increase in buffer size produces similar increases in base-1-9 and base-2-9 : 12.7% and 14.9% in power, respectively; and 16.1% and 16.9% in area, respectively. The area and power percentages are slightly higher than the buffer size because extra auxiliary components (mainly for control purposes) are also needed for larger buffers. These data also support the fact that the power and area consumptions are dominated by the buffers in the router.

Omega-2 produces the biggest reductions in area and power, for it has the smallest buffer size (25% less). There are reductions in area by 34.5% and in power by 26.8%. The reduction is much larger than the buffer size because there also exists no large crossbar switch, which takes up area and consumes power. 81

3.5.5 Critical Path Delay

The critical path delay, which is the delay produced by the longest timing path in the circuit, limits the clock rate that can be applied to the circuitry. We can compute the delay from the slack time produced by the circuit synthesis tool. In general, it is the difference between the clock period and the slack time. This holds true for the baseline router design. However, since the Omega router uses both the rising and falling edges of the clock, the critical path delay needs be computed from the half of the clock period. Using the slack data in Table 3.5 and the clock period of 1000 ps, we compute the critical path delays in various router designs. The results are shown in Table 3.11. We also compute the percentage increase of the delays over the baseline router design, base-1-8.

Table 3.11: Critical path delays of various router designs.

base-1-8 base-1-9 base-2-8 base-2-9 omega-2 omega-3 Critical path delay (ps) 743 757 955 982 416 479 Compared to base-1-8 1.9% 28.5% 32.2% -44.0% -35.5%

Since more circuit logic requires more gates to implement, longer timing delay will be incurred. Thus, the critical path delay can be considered as an indicator of the complexity of the circuitry. Table 3.11 clearly shows that the Omega design is simpler than the baseline design; the critical path delay of omega-2 is 44% less than that of base-1-8.

Note that since the falling edge of the clock is also used in the Omega design, one can adjust the duty cycle of the clock to gain more “slack” in some parts of the circuit. Of course, increasing the time in one side of the clock signal decreases the time in the other side of the signal. Nevertheless, the designer has one more parameter to fine-tune the circuitry. 82

3.5.6 Summary

We summarize all the analyses in the previous sections here. We normalize four performance metrics with respect to the baseline design base-1-8. Figure 3.16 shows the results. Omega-2 has the best overall performance in latency, power and area. It has the smallest latency—53% improvement over base-1-8 ; it takes up the least area—34% less, and it consumes the least power—27% less. As we can see from the figure, its advantages over the closest competitor are substantial. These are contributed by the elimination of the crossbar switch and using fewer virtual channels (smaller total buffer size)—30% fewer. Although omega-2 has the smallest total buffer size, it has the same saturation point as the baseline design. It is 11% less than the best performer, base-1-9. However, base-1-9 has much worse performance than omega-2 in the other two metrics: 50% more in area and 40% more in power.

3.6 Related Work

NoC is a relatively new area in computer architecture that has been around less than twenty years and developed as a consequence of advances in semiconductor fabrication technology. Much of the research in this area has been on routing algorithms, arbitration schemes, flow control or buffer design/management, which are based on conventional router microarchitecture design. There have been some architectural enhancements, like [38–41], but most of them are done in the conventional design. There really has not been much work done that has resulted in drastic router microarchitecture redesign. Kim [23] acknowledges that the research on NoC has typically been based on router microarchitecture from its off-chip counterpart. Kim argues that as a result, the design complexity has negative impact on the scalability of the on-chip networks. He proposes a low- cost router microarchitecture that is quite different from the conventional design. He lowers 83

1.40

1.17 1.20 1.16 1.15 1.11 1.13 1.08 1.09 1.05 1.01 1.00 1.00 1.00 1.00 1.00 1.01 1.00 0.99 1.00 0.93

0.80 base-1-8 0.73 base-1-9 0.66 0.63 0.63 base-2-8 0.60 base-2-9 0.47 0.47 omega-2 omega-3 0.40 Performance Normalized

0.20

0.00 Latency Saturation Area Power Performance metrics

Figure 3.16: Performances normalized to base-1-8. Note: Except for saturation, the lower the number is, the better is the performance. the cost by simplifying the router design and creates a dimension-sliced router partitioned into two parts, one for each dimension. His design targets the 2-D mesh networks. The crossbar switch remains, though it is smaller than the one in the conventional design, and there are still two, one for each dimension. The arbitration scheme is simpler and the number of buffers is fewer, compared to the conventional design. Packets are buffered at the front end, as in the conventional design. His work provides inspiration on how to arrange the exchanges in our work. We believe that our Omega design is more drastically different from the conventional design than Kim’s proposed design.

Abad et al. [42] also adopted a ring topology in their router, whose microarchitecture is radically different from the conventional design and is designed for general network topologies. There are also two circular paths, clockwise and counter-clockwise. Their design 84 does not use any virtual channels to avoid Head-of-Line blocking [20] (see Section 2.4.3); it relies on a packet circulating in the ring network if there is no exit port (to get out of the router) available yet. This continuous movement could result in substantial power consumption. Deadlock and livelock are avoided by using a special adaptive Bubble flow control mechanism [43], which guarantees that there is always an empty slot (a “hole”) in the buffering structure inside a ring. Thus, the control logic is more complicated than our Omega design and not all buffer space can be fully utilized. There is no crossbar switch as in the conventional design. Except for the high-level ring topology, the internal implementation is drastically different from our Omega design.

3.7 Discussion

The Omega microarchitecture is a drastically different design from the conventional router microarchitecture. It targets the prevailing 2-D mesh networks and adopts a ring-topology network in the router. The exchanges (network nodes) in the network are of identical and simple design. The arrangement of exchanges in the ring network is optimized for the popular dimension-order-routing algorithm for the mesh networks. Omega is a much simpler design than the conventional one. There is no large crossbar switch; buffers are situated at the exit point rather than at the entry point; the router itself is constructed with five identical modules, rather than with many different ones; and buffer access arbitration logic can be very simple since it needs to deal with just two inputs, not five like in the conventional design. All those factors bring about a very simple exchange design. Flit can traverse an exchange in one clock cycle. Simulations and circuit synthesis show superior performance over the conventional design on latency, area and power.

With fewer buffer resources and no large crossbar switch, Omega also outperforms the conventional router with advanced enhancements, except for saturation. 85

The buffer resources are more spread out in the Omega router. In contrast to the conventional router, the flit movements are more dynamic in the Omega router, which allows more flits to enter the router. As a result, the buffers are utilized more efficiently. The elimination of the crossbar switch and adopting smaller overall buffer size clearly help reduce the circuit area, from which we find the most improvement and reduction in power consumption. A simple round-robin algorithm is used in arbitration. More sophisticated algorithms could be adopted. This should help further defer the saturation point, delaying traffic congestions from happening. The Omega microarchitecture provides additional path diversity in the router, which is absent in the conventional microarchitecture. The injection of a flit into the core exchange can take on either a clockwise or counter-clockwise path. If the disjoint at the core exchange can be eliminated, such path diversity is then available to all entry ports in the router, which will facilitate adaptive routing algorithms and fault tolerance. The highly modular design of the Omega microarchitecture would also help in chip layout and wiring. Cost should be lower, too, because fewer components are needed. The future of Omega microarchitecture looks quite promising. However, this work is an initial effort on its design, so we believe many enhancements can be added to this seminal microarchitecture. For example, more sophisticated and popular schemes, like datapath by-passing logic and shared buffer management, can be considered to further improve performance. 86

Chapter 4

Circuit Implementation

The Booksim simulator is built using C++, a general-purpose computer language. The simulation only models the behaviors of a router at high level. A successful simulation does not guarantee the hardware realization of the new router design. There are many timing aspects in the circuitry that cannot be modeled accurately with the simulator. In this chapter, we describe in detail the circuit designs of the major functional units of an exchange (described in Section 3.2.2) in the Omega router. Hardware description language, Verilog, is used to build the router circuitry. We will use Verilog snippets and circuit schematics to describe the design details.

4.1 Route Computation

At each stop in the router along the route, a packet is evaluated to determine what direction it needs to take to move forward. There are actually two levels of routing in the Omega router. At the higher level, we need to determine how a packet should be routed from router to router, and at the lower level within the router, from one exchange to another. 87

4.1.1 Inter-Router

The Dimension-Order Routing (DOR) algorithm (see Section 2.4.2) is quite simple to implement. Listing 4.1 describes the algorithm in Verilog. The inputs are the coordinates of the current and destination routers, and the output is the port (north, east, south or west) that a packet needs to take to exit the current router. The listing illustrates the simplicity of the algorithm. The circuit can be implemented as combinational logic without using any clock and with little delay.

Listing 4.1: Routing algorithm from current router to the next router.

//(curr_x, curr_y) is the Cartesian coordinates of the current router. //(dest_x, dest_y) is the Cartesian coordinates of the destination router. // output_port is the desired exit port in the router. if(curr_x == dest_x) if(curr_y> dest_y) output_port=‘XC_SOUTH; else output_port=‘XC_NORTH; else if(curr_x> dest_x) output_port=‘XC_WEST; else output_port=‘XC_EAST;

4.1.2 Inter-Exchange

As mentioned in Section 3.2.5, we use a lookahead routing scheme; routing is done for the exchange downstream, not the current one. Listing 4.2 is a Verilog snippet showing the routing algorithm. The output port is explicitly expressed for each situation and essentially determined by a series of if-then-else operations. Again, the algorithm can be implemented 88 as combinational logic and can operate at very high speed, in tens of picoseconds (ps) with the adopted process technology.

Listing 4.2: Routing algorithm from exchange to exchange.

// Inputs: // router_port is the location of the exchange in the router. // exit_port is the exit port from the router. // xc_port is the location of the port in the exchange. // Output: // output_port is the exit port at the exchange downstream. case(router_port) ‘XC_CORE: case(exit_port) ‘XC_NORTH: output_port=(xc_port == ‘PORT_A)?‘PORT_E:‘PORT_B; ‘XC_EAST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_SOUTH: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_WEST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_E; ‘XC_CORE: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; endcase ‘XC_NORTH: case(exit_port) ‘XC_NORTH: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_EAST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_SOUTH: output_port=(xc_port == ‘PORT_A)?‘PORT_E:‘PORT_B; ‘XC_WEST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_CORE: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_E; endcase ‘XC_EAST: case(exit_port) ‘XC_NORTH: output_port=‘PORT_B; ‘XC_SOUTH: output_port=‘PORT_E; ‘XC_WEST: output_port=‘PORT_E; ‘XC_EAST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_CORE: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; endcase ‘XC_SOUTH: case(exit_port) ‘XC_NORTH: output_port=‘PORT_E; ‘XC_EAST: output_port=‘PORT_E; 89

‘XC_WEST: output_port=‘PORT_A; ‘XC_SOUTH: output_port=(xc_port == ‘PORT_A)?‘PORT_B:‘PORT_B; ‘XC_CORE: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; endcase ‘XC_WEST: case(exit_port) ‘XC_NORTH: output_port=‘PORT_B; ‘XC_EAST: output_port=‘PORT_E; ‘XC_SOUTH: output_port=‘PORT_B; ‘XC_WEST: output_port=(xc_port == ‘PORT_A)?‘PORT_A:‘PORT_B; ‘XC_CORE: output_port=(xc_port == ‘PORT_A)?‘PORT_E:‘PORT_B; endcase endcase

4.2 Buffer

Each exchange in the Omega router has three buffers, as shown in Figure 3.5(b). Each buffer contains a series of virtual channels that are implemented as first-in-first-out (FIFO) devices. Figure 4.1 shows a high level view of the buffer. At the input side, one single flit enters the buffer along with some control signals (primarily routing information). At the output side, one flit can be read out along with the associated control signals. The module is driven by the system clock. Both buffer write and read can take place in the same clock cycle. The two operations are asynchronous and independently controlled. The FIFO is eight flits (part of the parameters in Table 3.3) deep. The device is implemented as a 2-D array of registers for highest performance. Alternatively, it can be constructed with SRAM to save area at the expense of speed.

4.2.1 Write

Figure 4.2 shows the primary signals relevant to the write operation of the buffer. Note that writing to the FIFOs occurs on the negative edge of the system clock. Control signals 90

FIFO0 … Flit_in Flit_out FIFO Ctrl_in N Ctrl_out

Clk

Figure 4.1: High level view of buffer.

VC wr0−N determine which FIFO to write to, one dedicated signal per device. The output

signals Empty0−N from the FIFOs provide other control circuitry the emptiness statuses of the FIFOs. For example, after a write operation is done, an Empty signal may turn active to inactive, indicating there is something in the FIFO. Note that there are also control signals stored along with the flit. For example, for buffer operations, the FIFO stores the output port ID (produced by the route computation unit) for the exchange downstream.

Flit_in data_in wr_en FIFO empty Empty0-N wr_clk 0 … Clk

data_in wr_en empty VC_wr0-N FIFO wr_clk N

Figure 4.2: Write operation of buffer. 91

4.2.2 Read

Figure 4.3 shows the primary signals relevant to the read operation of the buffer. Reading

from the FIFOs occurs on the positive edge of the system clock. Control signals VC rd0−N de- termine which FIFO to read, one dedicated signal per device. The output signals Empty0−N (the same signals in Figure 4.2) indicate whether the FIFOs are empty or not. The flit at the front of a FIFO always appears at its output for flow control purposes. The stored output port ID (produced by the route computation unit) Out port0−N are always available for output arbitration and other control purposes. The output arbiter produces the VC sel signal to select which FIFO’s data are to be routed forward to the exchange downstream.

data_out rd_en FIFO rd_clk 0 empty

… Flit_out Clk

data_out rd_en VC_rd0-N FIFON rd_clk empty VC_sel

Out_port0-N

Empty0-N

Figure 4.3: Read operation of buffer.

4.3 Output Arbiter

Figure 4.4 shows the primary signals relevant to the output operations of the buffer. Flit can only be sent to the exchange downstream if there is buffer available and the write access is granted by the buffer arbiter downstream. 92

Last Grant

VC VC_sel 0,1 Arbiter Grnt0,1 Port_sel Retry Retry_port Output_port Buffer Out_port Req0,1 0-N Requester Empty0-N

Figure 4.4: Output arbiter high-level view.

Two sets of signals are received from the buffer arbiter downstream. Since the outgoing flit may go to one of the two buffers downstream, one set of signals corresponds to one

particular buffer. Grnt0,1 are the two grant signals for the buffer access requests done in

prior cycles, which in most cases is the previous cycle. VC0,1 are the corresponding granted virtual channels available in the exchange downstream. The Arbiter module produces signal VC sel to the output mux (in Figure 4.3) to select which flit in the current exchange will move forward. It also produces signal P ort sel to the exchange downstream for setting up the two muxes to write to the target buffers (in Figure 3.5(b)). If both grant signals are asserted in the same clock cycle, Arbiter would need to pick one of them to produce VC sel and P ort sel based on a round-robin scheme, which gives fair treatment to all virtual channels. In a round-robin scheme, the last granted virtual channel will have the lowest priority in the next arbitration cycle. There is a small storage, Last Grant, which stores the last port that has been granted access downstream. It will be used as one input to the arbitration circuitry. Listing 4.3 shows the Verilog code of the round-robin scheme. 93

Listing 4.3: Output arbitration.

assign output_en= Grnt[0] | Grnt[1]; always@(output_en) if(output_en) begin // At least receive one grant if(Grnt[0] && ~Grnt[1]) begin VC_sel=VC[0]; Port_sel = 0; Last_grant = 0; retry = 0; end else if (~Grnt[0] && Grnt[1]) begin VC_sel=VC[1]; Port_sel = 1; Last_grant = 1; retry = 0; end else begin // Receive both grants if(Last_grant == 0) begin VC_sel=VC[1]; Port_sel = 1; Last_grant = 1; retry_port = 0; end else begin VC_sel=VC[0]; Port_sel = 0; Last_grant = 0; retry_port = 1; end retry = 1; end end else begin // Nothing happens; everything remains the same retry = 0; end

Notice that the logics in Listing 4.3 may look repeatable in several parts. We write in this way so that the logic flow can be clearly expressed; this helps readers understand exactly how the circuit works. We rely on the Verilog compiler and the circuit synthesis tool to optimize the actual circuitry produced, which may lead to minimal amount of logic gates to be used. 94

For the loser in the arbitration, another request to access the buffer downstream will need to be made; this attempt will be made by the Buffer Requester module. In fact, there are three sources of input to the Buffer Requester:

1. Two grant signals are simultaneously received from the exchange downstream. With the decision from the arbiter, the Buffer Requester module makes another request for the virtual channel being dropped (the discarded grant signal) to the exchange downstream. The Arbiter module produces signals Retry and Retry port for this purpose, which indicate whether a retry is necessary, and if so, which port to retry.

2. Pending flits in the buffer. There may be multiple flits in the current buffer waiting to be output. Buffer Requester needs to monitor the buffer status and makes the appropriate

request at the proper time. This is based on the signals Out port0−N and Empty0−N .

3. Incoming flit to the exchange. After the route computation is done for the newly arrived flit, Buffer Requester will issue a request for it based on the signal Output port.

The priority of the three types of requests is the same as the order listed above. Pending flits in the buffer would have higher priority so as not to further delay them.

4.4 Buffer Arbiter

Each buffer in the exchange is associated with an arbiter. There are two separate exchanges upstream attempting to write to the buffer through separate input ports. The arbiter decides which exchange upstream will have access in the next clock cycle. Figure 4.5 shows a high level view of the arbiter with the primary signals. There are two sources and different types of inputs to the arbiter. P ort req0,1 are the write request signals from the two separate

exchanges upstream, and Empty0−N are the status signals from the FIFOs in the buffer to indicate which FIFOs are available (empty) to accept a new packet. The arbiter produces 95

two signals, one for each requester exchange upstream. P ort grnt0,1 indicate which exchange

upstream is granted access, and P ort vc0,1 indicate which virtual channel can be used. In case the two requests happen in the same clock cycle, the arbiter uses the same round- robin scheme as used by the Output Arbiter module described above to decide who wins. The small storage in Last Grant remembers the last granted input port.

Last Grant

Port_req0,1 Port_vc Arbiter 0,1 Empty0-N Port_grnt0,1

Figure 4.5: Buffer arbiter high-level view.

4.5 Summary

We have looked at the circuit implementation of the major components of an exchange in some detail. The buffers are constructed with dual-ported FIFOs, whose read and write operations can be carried out independently. The FIFOs are implemented with registers, which can operate at very high speed. Since the number of inputs to the arbiters is only two, the complexity of the circuitry is quite low. With the help of advanced synthesis tools, the circuits can be implemented effectively. Optimization is automatically applied to remove circuit redundancies expressed by the designer to reduce the logic gate count. As a result, the circuitry synthesized would experience a small delay and does not need large number of logic gates, which help reduce router latency, chip area and power. 96

Chapter 5

Buffer and Link Utilization Improvement

The Omega router microarchitecture provides opportunities to push the router performance further. In this chapter, we propose a feature that would improve the buffer and link utilizations by merging short flits from different packets in a buffer. This allows the network to be able to contain more flits, i.e., to have higher capacity, and pushes the saturation point on the latency-load curves further to the right.

5.1 Motivation

NoC designers take advantage of the abundance of wires available on a silicon die, so the typical link width adopted is quite wide. Also, due to the high availability of transistors, they can match the link width with the same buffer width to simplify the circuit design. However, the majority of packets traversing the network are short, so they do not fill up the entire buffer and link width. For example, for the wide buffer or link width of 128 bits that we use in our experiments, shown in Figure 1.4, the buffer/link utilization can be quite low—less 97 than 50%. This means that most of the time, the buffers are holding no useful information and the links are transferring them. As mentioned in Section 1.3, improving the buffer and link utilizations were the original impetus of devising the Omega router microarchitecture, for we found enhancing the conventional router design with a merging scheme very difficult to achieve. With the buffers located at the back-end of an exchange, as in the Omega router, merging flits becomes practical.

5.2 Microarchitecture Enhancement

In an Omega router, packets entering the same buffer will have to go through the same link exiting the exchange anyway. However, this may not be case in the conventional router design where the crossbar switch can divert them to different exit points. Thus, in the Omega router, we can try to use the “empty” spaces in the buffer without any concern of different exit points.

To demonstrate this enhancement to the Omega router microarchitecture more easily, we assume that the buffer width is 128 bits wide and the lengths of short packets are at most 32 bits long. Recall that a buffer has a number of virtual channels (see Section 4.2), and each one is implemented as a 128-bit wide FIFO. For short packets, they consist of only one single flit. Therefore, an entry in the FIFO can hold up to four short flits together.

To make the merging scheme more effective, we do allow merging with multi-flit packets in certain situations. In those situations, we assume that the head flit of a packet holds just a few pieces of information such that it can be treated as a short flit. However, because of the design nature of the serial input of a FIFO, if a multi-flit packet enters a FIFO first, the head flit will not be able to merge with incoming short flits. The body flit in the FIFO has already created a blockage to prevent the short flits from merging. As a result, a multi-flit packet can merge with a short packet, but not vice versa. 98

We make two small enhancements to the Omega router microarchitecture to facilitate merging flits in the FIFO (or in the same virtual channel). One enhancement is to do merging while the other one is to do splitting since merged packets need be split up when their routes start to diverge in a router.

5.2.1 Merging

We add some switching circuitry at the input of the FIFO to facilitate merging. Basically, we augment a standard FIFO design with additional switches. We partition the FIFO into four parallel narrower FIFOs, as shown in Figure 5.1, so that a short flit can be written into part of the buffer without affecting the currently residing flits. In the figure, F lit seg0 signifies the most significant segment of a flit, where real data are first stored. When the merging is occurring, the most significant unoccupied segment is always filled first. We use three muxes to redirect incoming flit segments to different parts of the buffer.

Flit_seg0 FIFO0-31

FIFO32-63 Flit_seg1

FIFO64-95 Flit_seg2

FIFO96-127 Flit_seg3

Figure 5.1: Datapath of merging flit at the input of a virtual channel. A 128-bit wide virtual channel is organized as 4 segments or FIFOs, each 32 bits wide. An incoming short flit can be directed to be stored in different segments of the virtual channel. The subscripts of FIFOs indicate the bit range of the segment. 99

Table 5.1 shows where segments of an incoming flit can be relocated in the FIFO if merging takes place. The position of an incoming segment will be shifted according to how many flits are already in the buffer. For example, incoming segment F lit seg0 is mapped to Segment 0 in the buffer if it is not occupied yet, and it will be mapped to Segment 2 if the

buffer already has two flits residing there. Notice that F lit seg3 can only go to Segment 3 in the buffer because the incoming flit already has four short flits merged in there, so the entire flit will have to enter the buffer unchanged.

Table 5.1: Flit segment redirecting during merging.

Incoming segment Possible segment in buffer 0 0, 1, 2, 3 1 1, 2, 3 2 2, 3 3 3

5.2.2 Splitting

After the route computation is done on a merged flit, if not all exit ports are the same, i.e., the routes of the flits start to diverge, the merged flit needs to be split up so that the embedded flits can go to different buffers. The splitting is done before the mux in front of the buffer (in Figure 3.5). We place two sets of muxes to split up a merged flit and to direct the flits to different buffers, as shown in Figure 5.2. We need two sets because two different flits after splitting may occupy the same segment position but in different buffers. Table 5.2 shows where segments of an incoming flit can be relocated in the FIFO if splitting does take place. The position of an incoming segment may be shifted according to

how many flits are separated. For example, incoming segment F lit seg1 is mapped to output

Segment 0 if it will go to a buffer different from F lit seg0; however, it will be mapped to

Segment 1 if it does not need to be separated from F lit seg0. Notice that F lit seg0 always 100

Flit from another exchange Flit_seg0

VC0 Flit_seg1 …

VCN Flit_seg2

Flit_seg3

VC0 …

VCN

Flit from another exchange Added muxes

Figure 5.2: Datapath of splitting flit at the front-end of an exchange. The red dashed box encloses the two sets of muxes required to do the splitting. goes to Segment 0 in the output because it is at the most significant segment; it will retain the same segment position no matter which buffer it will enter.

5.3 Evaluations

We use basically the same evaluation methodology for the Omega microarchitecture, as described in Section 3.3. We enhance the Booksim simulator with the new merging feature. We use synthetic uniform traffic for simulations. The baseline router configuration is base-1-8 (same as in Section 3.4). The flit-merging feature is implemented on top of the Omega router 101

Table 5.2: Flit segment redirecting during splitting.

Incoming segment Possible reassigned output segment 0 0 1 0, 1 2 0, 1, 2 3 0, 1, 2, 3

omega-2 (as in Section 3.4). We denote this new router design as omega-2-m (m stands for merge). Table 5.3 shows the primary network parameters used in simulations. We assume that short packets are 32 bits long, so one entry in the virtual channel can hold up to four short packets.

Table 5.3: Network parameters for evaluating flit merging.

Parameter Value Dimensions 8 × 8 Packet length one flit Buffer width 128 bits Link width 128 bits Number of virtual channels 2,8 Virtual channel depth 8 flits Routing algorithm DOR

We enhance the circuitry of the Omega router omega-2 with the flit merging and splitting logics to implement omega-2-m. We use the same Cadence tools and methodology as described in Section 3.3.2 to measure the area and power consumptions and critical path delay of omega-2-m. 102

5.4 Results and Analysis

5.4.1 Network Latency

Figure 5.3 shows the simulation results, up to saturation, and Figure 5.4 shows the same results for the entire injection rate range. The no-load latency of omega-2-m is the same as that of omega-2, which is indicated by the horizontal parts of both curves coinciding at light traffic loads. This is expected because merging flits do not affect latency at all under those conditions. The probability of flits meeting up in the router is actually quite low, as illustrated in Figure 5.5. Prior to the injection rate of 0.2, the percentage of flits that are merged with others in the router is essentially zero. The percentage gradually increases as the injection rate approaches the saturation point. Therefore, there is really not much, if any, merging going on at light traffic loads.

5.4.2 Network Saturation

In Figure 5.3, the vertical part of the omega-2-m curve clearly shows a distance away from the other two. That is, the network saturates at a much later point with the merging feature. Table 5.4 shows that omega-2 is only 6% worse in saturation than base-1-8 even though it has a substantial 25% fewer buffers. With the merging feature, the saturation is dramatically improved by 41% in omega-2-m, compared to base-1-8. The merging feature overcomes the large shortage in buffer. In theory, the network capacity of omega-2-m is four times of that of base-1-8. One might suspect the gain in saturation should be greater than the measured one. However, the relationship between saturation and network capacity is not a linear one. Otherwise, omega-2-m would have achieved the saturation point at around 1.28 (0.32 × 4), which is theoretically impossible as the load to the network cannot exceed 1, the maximum rate of flit injection of one flit per cycle. No matter how heavy the traffic load is, it is not always the 103

200 180

160 base-1-8 omega-2 140 omega-2-m 120 100 80 60

Average Latency (cycles) Latency Average 40 20 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Packet injection rate (packets/cycle/node) Figure 5.3: Average network latencies from synthetic uniform traffic, up to saturation. Keys: base-1-8: baseline router with 8 VCs/buffer; omega-2: Omega router with 2 VCs/buffer; omega-2-m: Same as omega-2 with flit-merging feature. case that the entire buffer width can be packed with four flits. As illustrated in Figure 5.5, no more than 6% of flits are merged even at high traffic loads, and even beyond the saturation point. One probable reason for such a low level of merging is that when a flit is ready to move forward on its route towards the destination, it cannot afford to wait for other flits to merge to slow it down. More work is needed to better understand the correlation between network capacity and saturation.

5.4.3 Network Throughput

Figure 5.6 shows the throughput performances from the three router designs. Same as what we saw in Section 3.5.3, omega-2-m is able to maintain the same input and output 104

100,000

10,000 base-1-8 omega-2 omega-2-m 1,000

100

Average Latency (cycles) Latency Average 10

1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Packet injection rate (packets/cycle/node)

Figure 5.4: Average network latencies from synthetic uniform traffic, full range. Same as Figure 5.3 but with the entire range of packet injection rates. Note that the vertical axis is in log scale. throughputs prior to the saturation points. This is expected because the merging feature is not adding any benefits when the traffic load is low. When the traffic load is high, similar to what we see on the effect of the merging feature on saturation, we also observe significant improvement on throughput. The throughput after the saturation point for both base-1-8 and omega-2 is about 0.325; for omega-2-m, it about 0.45. Therefore, the merging feature brings on an improvement of 38.5%.

5.4.4 Area, Power and Critical Path Delay

Table 5.5 shows the measurements from the Cadence’s circuit synthesis tool. Data for base-1- 8 and omega-2 are reproduced here from Tables 3.5 and 3.11 for comparison purposes. The 105

7.0%

6.0%

5.0%

4.0%

3.0%

2.0% Percentage Merge 1.0%

0.0% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Packet injection rate (packets/cycle/node)

Figure 5.5: Flit merging percentage in the router.

way we compute the critical path delays from slacks is the same way that we use in Section 3.5.5. Even with the added merging and splitting circuitries, omega-2-m still performs better than the baseline design base-1-8 in all metrics.

The enhancements of Omega-2-m over omega-2 certainly add more overhead in terms of area, power and delay. Table 5.6 shows the percentage of increase in those metrics. The biggest increase is in area. As shown in Table 5.3, the normal flit width is 128 bits. The buffer width of omega-2 is 135 bits: 128 bits are the flit payload and 7 bits are for routing control. However, the buffer width of omega-2-m drastically increases to 160 bits because each 32-bit segment in the flit now needs to hold the routing control information for each short flit: 32 bits are the short-flit payload, and 8 bits are for routing control. Therefore, there is an increase of 18.5% (from 135 to 160) in the buffer width. Furthermore, there are 106

Table 5.4: Network saturation comparisons. Notes: Data are computed from Figure 5.3. The number of virtual channels (VCs) over the baseline design (base-1-8 ) are also listed for reference. base-1-8 omega-2 omega-2-m Saturation 0.34 0.32 0.48 Normalized saturation 1 0.94 1.41 % VC increase 0% -25% -25%

Table 5.5: Router area and power consumptions and slacks. Note: Critical path delays are computed from slacks. base-1-8 omega-2 omega-2-m Area (μm2) 746,032 488,890 686,445 Power (nW) 183,919,185 134,561,900 149,292,339 Slack (ps) 257 84 56 Critical path delay (ps) 743 416 444 lots of switching logics and wiring in the merging and splitting circuitries. All those factors could contribute to a large increase in circuit area.

The power increase is not as much as area probably because the additional large amount of wiring does not consume much power at all. The increase in critical path delay appears to be moderate, considering all flits now have to go through two layers of logics for merging and splitting.

Table 5.6: Increases in area, power and critical path delay of omega-2-m over omega-2.

Percentage of increase Area 40.4% Power 10.9% Critical path delay 6.7% 107

0.50 0.45 0.40 base-1-8

) omega-2 0.35 omega-2-m 0.30 0.25 0.20 0.15 packets/cycle/node ( 0.10 Average Packet Throughput Throughput Packet Average 0.05 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Packet injection rate (packets/cycle/node)

Figure 5.6: Average packet throughputs from synthetic uniform traffic.

5.5 Summary

In this chapter, we propose a feature that is well-suited for implementation on the Omega router microarchitecture. Each exchange has to deal with only two input sources. In addition, the buffer is located at the back-end of the modules. These factors make the implementations of the flit-merging and flit-splitting features on the router microarchitecture quite simple and do not impose significant additional delay. With even much less buffer resources, the new router feature increases the effective network capacity substantially, compared to the conventional router design. Furthermore, the area and power consumptions are also less than those of the conventional design. As a result, the performance of the Omega router is pushed even further. 108

Chapter 6

Conclusion

We propose the novel Omega router microarchitecture for networks-on-chip. It is highly modular and a much simpler design than the conventional router microarchitecture. It adopts a ring-topology network in the router. The Omega router consists of five identical exchanges (network nodes) in the ring network.

Omega targets the prevailing 2-D mesh on-chip networks. The arrangement of exchanges in the ring network is optimized for the popular dimension-order-routing algorithm. Packet loopback is not allowed except at the core exchange. Because of these additional constraints, the design of the exchange is quite simple. There is no large crossbar switch like in the conventional design, which can consume much area and power. Packets can traverse an exchange in one clock cycle and the entire router in two cycles.

We use a cycle-accurate network-on-chip simulator to validate the Omega router design. We also use commercial VLSI design tools to estimate accurately the power and area consumptions of the router circuitry. Experimental results show the superior performance of Omega over the conventional design on latency, area and power, while using fewer resources.

The buffers of an Omega router are located at the back-end of a module, opposite to what is found in the conventional design. This property provides new opportunities to improve the 109 router performance. We demonstrate this with one case study. Short control flits going to the same direction can be merged in the router to make use of idling bandwidth. We enhance the Omega design with a feature that allows multiple short flits to share the same buffer space, which greatly increases the effective capacity of the network. Additional enhancements can be made to the new microarchitecture to push the performance even further.

The future of the Omega microarchitecture looks quite promising. We believe Omega can have significant impact on CMP designs, due to the large reductions in latency, area and power over the conventional router design. The improvement in latency would benefit high- performance computation applications like weather forecasting, circuit simulation, image recognition, data analytics, etc., where high compute performance is paramount. It would speed up communications among processors and reduce access latency to memory. The improvement in area and power would especially benefit the embedded applications like mobile devices, where area and power are the top factors in product designs. New products can be made smaller and use less battery power.

Future Work

This dissertation is an initial effort on the Omega microarchitecture; there is still much more to do. We would like to learn more about Omega, to better characterize it in different aspects. We can implement new features to push its performance further.

In our synthetic traffic simulations, we use fixed packet size, but in reality, the size varies according to the message types the packet carries. For example, a packet may carry the entire cache line, which can be 64 bytes long. That will involve five flits, each 128 bits long. We can run more simulations with different packet sizes to learn more about the performance of Omega under those conditions. 110

As the number of processing cores integrated on a single die is increasing steadily, it is expected that the NoC size will also increase accordingly. We should study the performance of Omega on larger network sizes, like 16 × 16 and even 32 × 32. Our simulations focus on measuring the network latency, which is probably the most important performance metric. Another closely related metric is the throughput. We have confirmed that Omega performs optimally and at the same level as the conventional design at light traffic loads. However, at high traffic loads, the performance appears to be quite inconsistent across a wide range of traffic patterns. We should investigate how the traffic patterns affect the throughput under such conditions. Stability of the network should also be studied, so we can learn how the network will behave under adverse or abnormal conditions. Omega provides additional path diversity in the router, which is absent in conventional architecture. The injection of a flit to the core exchange can take on either a clockwise or counter-clockwise path. The current design always takes on the clockwise path. We can make it flexible so that it can take on either path depending on the network or router conditions, which should improve the performance in terms of latency and saturation. Furthermore, if the disjoint at the core exchange can be eliminated, such path diversity will then be available at all entry ports in the router. Fault tolerance can be built on such property, allowing flits to be routed around non-functioning components or broken links. The Omega design facilitates adaptive routing algorithms and can make the network more fault tolerant. A simple round-robin algorithm is being used in arbitration. More sophisticated algo- rithms can probably be adopted, which could help defer the saturation point. More sophis- ticated and popular schemes, like datapath by-passing logic and shared buffer management, can be considered to further improve performance. The current design takes a minimum of two clock cycles to traverse the router. It is conceivable that we can achieve the ideal one-cycle time with a more clever design. The probability of flit merging at high traffic conditions seems low; we should investigate why that is the case. New routing algorithms can be devised to promote and maintain 111 merging as flits are traversing the network, to keep them merged as long as possible to gain better performance. The combined buffering overheads in handling short flits seems high. To reduce the overheads, we can devise new encoding schemes for the routing control information and adopt suitable advanced buffer management techniques. Some sort of dynamic buffer management schemes will probably be helpful. 112

Bibliography

[1] W. T. Wu and A. Louri, “A methodology for cognitive noc design,” IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 1–4, Jan 2016.

[2] K. Olukotun and L. Hammond, “The future of microprocessors,” Queue, vol. 3, no. 7, pp. 26–29, Sep. 2005. [Online]. Available: http://doi.acm.org/10.1145/1095408.1095418

[3] K. Rupp, “42 years of microprocessor trend data,” February 2018. [Online]. Available: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

[4] B. Keeth, R. J. Baker, B. Johnson, and F. Lin, DRAM circuit design: fundamental and high-speed topics. John Wiley & Sons, 2007, vol. 13.

[5] B. Jacob, S. Ng, and D. Wang, Memory systems: cache, DRAM, disk. Morgan Kaufmann, 2010.

[6] A. Olofsson, “Epiphany-v: A 1024-core 64-bit risc processor,” October 2016. [Online]. Available: https://www.parallella.org/2016/10/05/ epiphany-v-a-1024-core-64-bit-risc-processor

[7] B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu, A. Tran, E. Adeagbo, and B. Baas, “A 5.8 pj/op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000- processor array,” in 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), June 2016, pp. 1–2. 113

[8] “Pezy super computer,” December 2017. [Online]. Available: https://en.wikichip.org/ wiki/pezy/pezy-scx/pezy-sc

[9] E. Howell. (2017, June) Intel skylake-x vs amd threadripper: Die size comparison. [Online]. Available: http://digiworthy.com/2017/06/22/ intel-skylake-x-vs-threadripper-die-size

[10] (2018, May) Cpu socket. [Online]. Available: https://en.wikipedia.org/wiki/CPU socket

[11] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A. Hemani, “A network on chip architecture and design methodology,” in Proceed- ings IEEE Computer Society Annual Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI 2002, 2002, pp. 105–112.

[12] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[13] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, Digital integrated circuits. Prentice hall Englewood Cliffs, 2002, vol. 2.

[14] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2011.

[15] W. J. Dally and B. Towles, “Route packets, not wires: On-chip interconnection net- works,” in Design Automation Conference, 2001. Proceedings. IEEE, 2001, pp. 684– 689.

[16] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration,” in Proceedings of the conference on Design, Automation and Test in Europe. European Design and Automation Association, 2009, pp. 423–428. 114

[17] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing power in high-performance microprocessors,” in Proceedings of the 35th annual Design Automation Conference. ACM, 1998, pp. 732–737.

[18] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson, J. Oberg, P. Ellervee, and D. Lundqvist, “Lowering power consumption in clock by using globally asynchronous locally synchronous design style,” in Proceedings of the 36th annual ACM/IEEE Design Automation Conference. ACM, 1999, pp. 873–878.

[19] W. J. Dally and B. P. Towles, Principles and practices of interconnection networks. Elsevier, 2004.

[20] N. E. Jerger and L.-S. Peh, “On-chip networks,” Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1–141, 2009.

[21] J. Duato, S. Yalamanchili, and L. M. Ni, Interconnection networks: an engineering approach. Morgan Kaufmann, 2003.

[22] J. D. Owens, W. J. Dally, R. Ho, D. Jayasimha, S. W. Keckler, and L.-S. Peh, “Research challenges for on-chip interconnection networks,” IEEE micro, vol. 27, no. 5, pp. 96–108, 2007.

[23] J. Kim, “Low-cost router microarchitecture for on-chip networks,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2009, pp. 255–266.

[24] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D. Lindqvist, “Network on chip: An architecture for billion transistor era,” in Proceeding of the IEEE NorChip Conference, vol. 31, 2000, p. 11.

[25] L. Benini and G. De Micheli, “Networks on chips: A new soc paradigm,” computer, vol. 35, no. 1, pp. 70–78, 2002. 115

[26] N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim, “A detailed and flexible cycle-accurate network-on-chip simulator,” in Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. IEEE, 2013, pp. 86–96.

[27] D. U. Becker, “Efficient microarchitecture for network-on-chip routers,” Ph.D. disserta- tion, Stanford University Palo Alto, 2012.

[28] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Sympo- sium on. IEEE, 2010, pp. 1–12.

[29] [Online]. Available: https://www.cadence.com/content/cadence-www/global/en US/ home/tools/digital-design-and-signoff/synthesis/genus-synthesis-solution.html

[30] [Online]. Available: https://www.cadence.com/content/cadence-www/global/en US/ home/tools/system-design-and-verification/simulation-and-testbench-verification/ incisive-enterprise-simulator.html

[31] A. Tanenbaum and D. Wetherall, Computer networks, 5th ed. Pearson Prentice Hall, 2011.

[32] J. Hestness, B. Grot, and S. W. Keckler, “Netrace: dependency-driven trace-based network-on-chip simulation,” in Proceedings of the Third International Workshop on Network on Chip Architectures. ACM, 2010, pp. 31–36.

[33] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characteriza- tion and architectural implications,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp. 72–81. 116

[34] G. F. Pfister, “An introduction to the infiniband architecture,” High Performance Mass Storage and Parallel I/O, vol. 42, pp. 617–632, 2001.

[35] C. Sicam, R. Baclit, P. Membrey, and J. Newbigin, Foundations of CentOS Linux: Enterprise Linux On the Cheap. Springer, 2010.

[36] [Online]. Available: https://pbsworks.com/PBSProduct.aspx?n=PBS-Professional&c= Overview-and-Capabilities

[37] L.-S. Peh and W. J. Dally, “A delay model and speculative architecture for pipelined routers,” in High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium on. IEEE, 2001, pp. 255–266.

[38] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Express virtual channels: towards the ideal interconnection fabric,” in ACM SIGARCH Computer Architecture News, vol. 35, no. 2. ACM, 2007, pp. 150–161.

[39] N. E. Jerger, L.-S. Peh, and M. Lipasti, “Virtual circuit tree multicasting: A case for on-chip hardware multicast support,” in Computer Architecture, 2008. ISCA’08. 35th International Symposium on. IEEE, 2008, pp. 229–240.

[40] T. Krishna and L.-S. Peh, “Single-cycle collective communication over a shared network fabric,” in Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Sympo- sium on. IEEE, 2014, pp. 1–8.

[41] B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, “Quest for high-performance bufferless nocs with single-cycle express paths and self-learning throttling,” in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 36.

[42] P. Abad, V. Puente, J. A. Gregorio, and P. Prieto, “Rotary router: an efficient architec- ture for cmp interconnection networks,” ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 116–125, 2007. 117

[43] V. Puente, R. Beivide, J. A. Gregorio, J. Prellezo, J. Duato, and C. Izu, “Adaptive bub- ble router: a design to improve performance in torus networks,” in Parallel Processing, 1999. Proceedings. 1999 International Conference on. IEEE, 1999, pp. 58–67.