Designing Low Power and High Performance Network-on-Chip Communication Architectures for Nanometer SoCs
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Gursharan Reehal, B.S., M.S.
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2012
Dissertation Committee:
Prof. Mohammed Ismail El-Naggar, Advisor Prof. Steve Bibyk Prof. Joanne DeGroat c Copyright by
Gursharan Reehal
2012 ABSTRACT
Network-on-Chip (NoC) communication architectures have been recognized as the most scalable and efficient solution for on chip communication challenges in the multi-core era. Diverse demanding applications coupled with the ability to integrate billions of transistors on a single chip are some of the main driving forces behind ever increasing performance requirements towards the level that requires several tens to over a hundred of cores per chip, with aggregate performance exceeding one trillion operations per second. Such tera-scale many-core processors will be highly integrated
System-on-Chip designs (SoC) containing a variety of on-chip storage elements, mem- ory controllers, and input/output (I/O) functional blocks. Small scale multicore pro- cessors so far have been a great commercial success and found applicability in high bandwidth, computer intensive applications including high performance, throughput oriented, scientific computing, high performance graphics and 3-D immersive visual interfaces, as well as in decision and support systems. Systems using multi-core pro- cessors are now the norm rather than the exception.
As the number of cores or components integrated into a single system is keep increasing, the design of on-chip communication architecture is becoming more chal- lenging. The increasing number of components in a system translates into more
ii inter-component communication that must be handled by the on-chip communica- tion infrastructure. It’s not surprising to see that leading-edge design teams search- ing for better solutions as multi-core SoCs continue to evolve. Future system-on-chip
(SoC) designs require predictable, scalable and reusable on-chip communication ar- chitectures to increase reliability and productivity. Current bus-based interconnect architectures are inherently non-scalable, less adaptable for reuse and their reliability decreases with system size.
NoC communication guarantees scalability, high-speed, high-bandwidth commu- nication with minimal wiring overhead and routing issues. NoCs are layered, packet- based on-chip communication networks integrated onto a single chip and their opera- tion is based on the operating principle of macro networks. NoC consists of resources and switches that are directly connected in a way that resources are able to com- municate with each other by sending messages. The proficiency of a NoC to meet its design goals and budget requirements for the target application depends on its design. Often, these design goals conflict and trade-off with each other. The multi- dimensional pull of design constraints in addition to technology scaling complicates the process of NoC design in many aspects, as they are expected to support high performance and reliability along with low cost, smaller area, less time-to-market and lower power consumption. To aid the process, this research presents design method- ologies to achieve low power and high performance NoC communication architectures for nanometer SoCs.
In NoCs, interconnects play a crucial role in the overall system performance and can have a large impact on the total power consumption, wiring area and achievable system performance. The effect of technology scaling on the NoC interconnects is
iii studied and an improved design flow is presented. The influence of technology node, die size, number of components on the power consumption by the NoC interconnects is analyzed.
The success of NoC heavily depends on its power budget. As CMOS technology continues to scale, power aware design is more important than ever before, especially for the designs targeted towards low power applications, however in large scale NoCs the power consumption can increase beyond acceptable limits. Designing low power
NoC is therefore extremely important especially for larger SoCs designs. The elevation of power to a first-class design constraint requires that power estimations are done at the same time as the performance studies in the design flow. In NoC, one method to have a power aware design is to consider the impact of the architectural choices on the power in the early stages of design process. In this research an efficient design methodology based on the layout and power models is presented to have rough power estimates in the early stages of design cycle. The impact of die size and number of
IPs on the power consumed by different NoC architectures is evaluated.
Additionally, as multi-core SoCs continue to evolve, Globally Asynchronous Lo- cally Synchronous (GALS) design techniques have been suggested as a potential solu- tion in larger and faster SoC designs to avoid the problems related to synchronization and clock skew. These multi-core SoCs will operate using GALS paradigm, where each core can operate in a separate clock domain. In this research a study on the power efficiency between Synchronous and Asynchronous NoC architecures is presented.
Asynchronous NoC architecture is shown to consume less power, when activity factor of data transfers between two switches is within certain range. Asynchronous designs are more power efficient, as the need for clock distribution is eliminated.
iv Dedicated to my mother, for her love, support and encouragement. . .
v ACKNOWLEDGMENTS
I would like to thank many remarkable people who have supported and encouraged me during my time at The Ohio State University. First and foremost, I would like to thank my advisor, Prof. Mohammed Ismail for his continued guidance, support and sustained encouragement throughout my graduate study. He always encouraged me to think independently and to define my own research agenda, allowing me to develop the skills necessary to do research. His significant effort in bringing the Synopsys
Software, an industry standard EDA tool for research in the area of Digital VLSI and henceforth enriching the quality of education and student research experience here at The Ohio State University is highly remarkable. Without his commitment and encouragement, this dissertation would not have been possible. The experience with
Prof. Ismail will always be highly regarded.
I would like to thank Prof. Steve Bibyk for his early guidance, mentorship, en- couragement and support to pursue the PhD program. I am thankful for his valuable discussions throughout my time in the ECE department and for being my MS advi- sor. It truly has been a great experience working with him. I am also very grateful to Prof. Joanne DeGroat for her support, guidance and for kindly serving on my
PhD exam committees. I am thankful to her for treating me as a member of her own group.
vi I am thankful for the opportunity to do some research with Mohamed Abd Ghany, who is affiliated with the German University Cairo in Cairo Egypt. He has been a very helpful friend and a great source of guidance in the work on Asynchronous NoCs.
I am grateful for his insight in the asynchronous digital design and for his support. I enjoyed working with him.
I am honored to call myself a member of the VLSI Lab. I want to thank my fellow colleagues Amneh Akour, Sleiman Bou Sleiman, John Hu, Sidharth Balasubramanian,
Yiqiao Lin, Feiran Lei and Samantha Yoder for their friendship, discussions and valuable guidance. In particular, I would like to thank Amneh Akour, who has been a great source of sage advice in the times, when I needed it the most. It truly has been a privilege for me to work with them in the VLSI lab and I always enjoyed and cherished their company.
I would also like to thank many people in the ECE department. Stephanie
Muldrow, Carol Duhigg, Tricia Toothman, Vincent Juodvalkis, Aaron Aufderheide,
Don Gibb, and Edwin Lim, who work very hard and diligently behind the scenes, to make this department a wonderful place for graduate students. Their friendly and helping nature ease the stress of a graduate student and who are always willing to help and guide students in our department.
Life as a graduate student would not be possible without the help of family and friends. My deepest regards goes to my mother, to whom I dedicate this work. She always encouraged me and supported me to pursue higher education. She worked very hard in making sure, education is a priority. She always emphasized the importance of education, and encouraged me to pursue the PhD program. She has been a great source of inspiration in my life and education. I am deeply thankful to her for her
vii endless love, care, support, wisdom...and basically everything. Your love and faith in me has made all the difference...love you Mom!
My final thanks goes to the reason I am here and able to do this work, my creator.
I am thankful for the talent and opportunities I have been given and the strength to undertake this task and see it to completion!
viii VITA
1996 ...... B.S. Electrical Engineering
1998 ...... M.S. Electrical Engineering
2007- 2009 ...... Graduate Teaching Associate, The Ohio State University. 2010 ...... Graduate Technical Intern, Intel Corporation 2010 ...... Graduate Technical Research Intern, Intel Labs
FIELDS OF STUDY
Major Field: Electrical and Computer Engineering
ix TABLE OF CONTENTS
Page
Abstract...... ii
Dedication...... v
Acknowledgments...... vi
Vita...... ix
List of Tables...... xiii
List of Figures...... xv
Chapters:
1. Introduction...... 1
1.1 Bus Based On-Chip Communication Architectures...... 3 1.1.1 AMBA Bus...... 4 1.1.2 CoreConnect Bus...... 5 1.1.3 WishBone Bus...... 7 1.1.4 SiliconBackplane MicroNetwork...... 8 1.2 Limitations of the Bus based Communication Approach...... 9 1.3 Why NoC ?...... 12 1.4 NoC Design Considerations and Challenges...... 15 1.5 Organization of this Thesis...... 21
2. NoC Overview : Architecture, Performance and Cost...... 22
2.1 NoC Building Blocks...... 23 2.1.1 Network Interfaces...... 24 2.1.2 Switches...... 24
x 2.1.3 Links...... 25 2.2 NoC Architectures...... 25 2.2.1 CLICHE´...... 27 2.2.2 TORUS...... 28 2.2.3 BFT...... 29 2.2.4 SPIN...... 30 2.2.5 OCTAGON...... 30 2.3 NoC Flow Control Protocols...... 31 2.4 NoC Switching Techniques...... 33 2.5 NoC Routing...... 36 2.6 NoC Performance and Cost...... 38 2.6.1 NoC Power Dissipation...... 39 2.6.2 NoC Area Overhead...... 40 2.6.3 NoC Message Latency...... 40 2.6.4 NoC Throughput...... 41 2.7 High-level Physical Characteristics of NoC Architectures...... 43 2.8 NoC Design Flow...... 45 2.9 Summary...... 47
3. NoC Router Architecture Design and Cost...... 48
3.1 Main Parts of NoC Router...... 49 3.1.1 Input/Output Ports...... 50 3.1.2 Virtual Channels...... 51 3.1.3 Buffers...... 52 3.1.4 Crossbar Logic...... 52 3.1.5 Input/Output Arbiter...... 54 3.1.6 Control Logic...... 57 3.2 Packet Format...... 59 3.3 NoC Router Design and Cost...... 60 3.3.1 Router Design-I...... 61 3.3.2 Router Design-II using ASIC Design Flow...... 70 3.4 Summary...... 80
4. High Performance NoC Interconnects...... 81
4.1 NoC Interconnects...... 84 4.1.1 Performance Optimization Using Intrinsic RC Model.... 86 4.1.2 Performance Optimization using Repeater Insertion..... 92 4.2 NoC Power Consumption In Physical Links...... 95 4.3 A Layout-Aware NoC Design Methodology...... 97 4.4 Summary...... 97
xi 5. Layout Aware NoC Design Methodology...... 99
5.1 CMOS Power Dissipation...... 100 5.2 Power Analysis for NoC-based Systems...... 103 5.2.1 Cliche Architecture Power Model...... 104 5.2.2 BFT Architecture Power Model...... 105 5.2.3 SPIN Architecture Power Model...... 107 5.2.4 Octagon Architecture Power Model...... 108 5.3 IP Based Design Methodology for NoC...... 110 5.4 Network Power Analysis...... 112 5.5 Summary...... 117
6. Power Efficient Asynchronous Network on Chip Architecture...... 118
6.1 Asynchronous NoC Architecture...... 120 6.2 Synchronous NoC Architecture...... 126 6.3 Power Dissipation...... 128 6.4 Simulation Results...... 131 6.5 Comparison...... 135 6.6 Summary...... 140
7. Conclusion and Future Work...... 142
7.1 Thesis Summary and Contributions...... 143 7.2 Future Work...... 147
Appendices:
A. NoC Examples...... 150
B. Abbreviation...... 162
Bibliography...... 164
xii LIST OF TABLES
Table Page
1.1 Bus Architecture Specification [1]...... 10
3.1 Percentage Increase in Throughput for Different NoC Architectures. 65
3.2 Power Consumption for Different NoC Architectures...... 67
3.3 Power Reduction Per Component using Sleep Transistors...... 69
3.4 Power Reduction of a Switch for Different NoC Architectures using Sleep Transistors...... 69
3.5 Input and Output Ports of NoC Router...... 70
3.6 Power Overhead of Routers for Different NoC Architectures in RVT process(f = 200MHz,α = 0.1)...... 75
3.7 Power Consumption of 4-, 5-, 6-,7- and 8- port NoC routers at various operating frequencies...... 76
3.8 A Guide for Leakage Power Considerations...... 78
3.9 Power Dissipation for a Network of 64 IPs at 200 MHZ and α = 0.1. 79
4.1 Technology and Circuit Model Parameters from ITRS Reports(2001- 2010)...... 82
4.2 Bulk Resistivity of pure metal at 22 degree C...... 88
4.3 Relative Permittivity r of some Dielectric Materials...... 91
xiii 4.4 Interconnect Power and Area Consumption: Intrinsic Case, f=400 MHz and α = 1...... 96
4.5 Interconnect Power and Area Consumption: Width and Space Opti- mization, f=400MHz and α = 1...... 96
4.6 Total Power and Area Consumption, f=400MHz and α = 1..... 96
5.1 Power Consumption for 16 IPs...... 115
5.2 Power Consumption for 64 IPs...... 115
5.3 Power Consumption for 256 IPs...... 115
6.1 Total Metal Resources Required for BFT Architecture...... 135
6.2 Power Consumption For BFT Architecture...... 136
6.3 Power Consumption For Cliche Architecture...... 139
6.4 Power Consumption For Octagon Architecture...... 139
6.5 Power Consumption For SPIN Architecture...... 139
6.6 Total metal resources...... 140
A.1 Intel’s 80-Core Tera Scale Processor Specifications...... 152
A.2 Intel’s 48-Core Single-Chip Cloud Computer Processor Data..... 153
A.3 Tlera’s multicore Processor Data...... 155
xiv LIST OF FIGURES
Figure Page
1.1 Evolution of the IC Integration Level (a) First IC with 4 transistors by Fairchild Semiconductor [2]. (c)Intel Pentium 4 microprocessor with 50 million transistors...... 1
1.2 AMBA Bus...... 4
1.3 CoreConnect Bus...... 6
1.4 WishBone Bus Interconnection Architectures (Silicore 2002) [3]...7
1.5 Silicon Backplane Micro-Network Bus Architecture...... 8
1.6 Bus Layout Schemes...... 11
1.7 A SoC Design...... 12
1.8 SoC-based consumer portable design complexity trends [4]...... 13
1.9 SoC Design Space...... 14
1.10 Power Density in Intel´sMicroprocessors...... 17
1.11 Gate and Wiring Delay Vs. Future Technology Nodes [5]...... 19
2.1 Conceptual view of Network-on-Chip [6]...... 22
2.2 Network-on-Chip...... 23
2.3 Eleven Standard NoC Topologies...... 26
xv 2.4 CLICHE´ Architecture...... 27
2.5 Torus Architecture...... 28
2.6 BFT Architecture...... 29
2.7 SPIN Architecture...... 30
2.8 Octagon Architecture...... 31
2.9 NoC Switching Techniques...... 33
2.10 Store & Forward Routing Vs. Cut-Through Routing...... 36
3.1 A Generic Router Design...... 48
3.2 An Input Port of the Switch...... 50
3.3 An Input Port of the Switch with Virtual Channels...... 51
3.4 A 3x3 Crossbar Implemented using a Multiplexer for each Output.. 53
3.5 2-D Implementation of a 3x3 Multiplexer based Crossbar...... 54
3.6 A Matrix Arbiter Design...... 56
3.7 State Diagram of Port Controller...... 57
3.8 Control Flow Algorithm for Input Virtual Channels...... 58
3.9 Packet Format...... 59
3.10 NoC Port I/O...... 61
3.11 Power per Component...... 62
3.12 High Throughput Arbiter Design...... 63
3.13 Max. Frequency of Switch with different number of Virtual Channels 64
3.14 Throughput vs. Virtual Channels for Different NoC Topologies.... 65
xvi 3.15 Latency of NoC Topologies with Different Number of Virtual Channels 66
3.16 NoC Port Design for Reducing Leakage Power...... 68
3.17 Power Consumption for Different NoC Architectures...... 68
3.18 Power Analysis Requirements...... 71
3.19 Power Measurement Flowchart for NoC Routers using Synopsys Tools 72
3.20 Power per Component...... 73
3.21 Total Power Consumed by Routers for Different Number of IPs... 74
3.22 Leakage Power vs Technology Nodes [5]...... 77
3.23 Difference in Leakage Power for a 6 Port Router Design using Different
Vt Cells for a 6 Port Router Design...... 78
3.24 Frequency vs. Area of the Switch...... 79
4.1 Metal Layers in different technology nodes...... 81
4.2 Gate Delay vs. Wire Delay in Different Technology Nodes...... 83
4.3 NoC Interconnects...... 85
4.4 One Clock Cycle Requirement for High Performance NoC Designs.. 86
4.5 Intrinsic RC delay and 15FO4 limit...... 87
4.6 Interconnect Resistance...... 88
4.7 Cross-Sectional View of Semi-Global Layer Interconnects...... 89
4.8 Interconnect with Repeaters...... 93
4.9 An Improved ASIC Design Flow for NoC in Deep Nanometer Regime 98
5.1 Diminshing Returns of Power...... 102
xvii 5.2 Layout of Cliche architecture...... 104
5.3 Layout of BFT architecture...... 106
5.4 Layout of SPIN architecture...... 108
5.5 Layout of Octagon architecture...... 109
5.6 Number of cores with technology scaling...... 111
5.7 Length of longest interconnect with increasing number of IPs..... 112
5.8 A Methodology for Power Efficient NoC Design...... 113
5.9 Total Power of the Network...... 114
5.10 Distribution of NoC Power Consumption...... 116
6.1 Port Interface (a) Asynchronous Design(b) Synchronous Design... 119
6.2 Asynchronous NoC Architecture...... 120
6.3 Asynchronous Port Architecture...... 121
6.4 Asnynchronous FIFO Cell...... 122
6.5 PTC Circuit...... 123
6.6 GTC Circuit...... 124
6.7 Burst Mode Specification of PTC and GTC...... 125
6.8 DSC Circuit...... 125
6.9 Synchronous Switch Port Design...... 126
6.10 Clock Tree Network for Synchronous BFT Architecture...... 127
6.11 Power Dissipation in Syn. and Asyn. BFT Architecture...... 133
xviii 6.12 Power Dissipation in Syn. and Asyn. BFT Architecture...... 134
6.13 Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5.... 137
6.14 Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5 and
αcs=1/64αdata ...... 138
A.1 Intel’s 80 Core Tera Flop Processor...... 151
A.2 Intel’s 48-Core (24 tiles with two IA cores per tile)SCC Processor.. 152
A.3 Tlera’s Multicore Processor...... 154
A.4 The Blue Gene/Q SoC integrates 18 homogenous cores...... 156
A.5 BONE Evolution...... 158
A.6 FAUST Chip Architecture...... 160
xix CHAPTER 1
Introduction
It all began in 1959, with the invention of first semiconductor based transistor design. Since then, the world of Integrated Circuits (IC) is becoming more and more complex, as shown in Figure 1.1. Every two year with a new technology node, the number of transistors that can be fitted on the same area doubles, a trend roughly following Moore’s Law.
(a) First silicon IC (b) Intel 4004 (c) Intel Pentium4(d) Intel Corei7 Microprocessor (2009) in the history (1959) Ist µProcessor Microprocessor (1971) (2000)
Figure 1.1: Evolution of the IC Integration Level (a) First IC with 4 transistors by Fairchild Semiconductor [2]. (c)Intel Pentium 4 microprocessor with 50 million transistors
With technology scaling, it is now possible to integrate more than two billion transis- tors onto a single chip, and the capacity is still growing with more smaller technology
1 nodes. Market demand for smaller and high performance devices, is keep pushing the semiconductor technology to smaller process nodes, and packing more functionality into a single die than ever before. Consumers demand high-quality, multi-functional and feature-rich electronic products at a low price. As a result, product differentiation is more now than ever before, and is being achieved by increased functionality, higher performance, improved power efficiency and more application-specific features. Over the past ten years, as integrated circuits are increasingly becoming more complex, the industry began to embrace new design and reuse methodologies, that are collectively referred to as System-on-Chip (SoC) design.
The term SoC is fairly new in the semiconductor industry, but very rapidly it is replacing the other more popular acronyms of the past, such as VLSI (Very large
Scale Integration) and ULSI (Ultra large Scale Integration). The change in name is nothing but a mere reflection of the fact that there is a paradigm shift - a shift in focus from Chip Design to System Design. Before the SoC, semiconductor technology and circuits themselves played the central role as a discipline, as an industry or research focus. However in the newer SoC era the focus is now shifting more on the system beyond the chip design.
SoCs at present are being used in limited applications such as mobile phones, smart phones, digital cameras, HDTVs and in gaming consoles, but many many more applications will use them in the near future as they become more powerful and easy to develop. Recently, Intel’s CEO Paul Otelini mentioned that he can easily see a time where Intel will ship more SoCs than standard microprocessors. This statement is truly remarkable in the sense that it clearly outlines the focus of a big semiconductor company like Intel in the near future.
2 SoCs is a system level solution, by integrating many different components, con- nected together on a single chip to achieve a common goal, with a final application in mind. The big advantage of any SoC design is that it offers tremendous computation power as a complete system into a single chip. In the earlier concepts of SoCs the main goal was to copy the system implemented on a PCB with discrete components onto a single silicon chip by adopting the same bus architectures as those used in the
PCB. Previously these components were few and were interconnected using shared bus architecture, but now with the more complex SoC designs, the number of IP blocks is keep increasing, and as a result the performance of this shared bus approach is unfit for designs with larger numbers of IPs. In shared bus approach, arbitration is used among several requesters and when more and more components are attached to the same single bus, it introduces an increased load on the bus, and as a result the speed of the bus drops. To solve this problem new design approaches are crucial for the vitality of future SoC designs with thousands of IPs. Before diving into the the discussion of solution to this problem, some of the main shared bus schemes used in some of the contemporary SoC designs are discussed in the next section.
1.1 Bus Based On-Chip Communication Architectures
A bus, is a group of lines that serves as a communication path for several devices.
In addition to the lines that carry the data, the bus also has lines for address and control signals. A Shared bus or simple a bus is still the most common way to move on chip-data in SoC designs with fewer IPs and is commonly found in many commercial
SoCs of present. In this scheme, several masters and slaves can be connected to a shared bus. A bus arbiter, periodically examines accumulated requests from the
3 multiple master interfaces and grants access to a master using arbitration mechanisms specified by the bus protocol. In a shared bus architecture communication, there are many advantages such as, simple topology, extensibility, low area cost, easy to build, efficient to implement. However, increased load on global bus lines limits the bus bandwidth and as a result there are longer delay for data transfer with larger energy consumption in this approach. Some of the main bus architecture design are
1.1.1 AMBA Bus
AMBA (Advanced Microcontroller Bus Architecture) bus standard is developed by
ARM with an aim to support efficient on-chip communication among ARM processor cores. Nowadays, AMBA is one of the leading on-chip busing systems used in high performance SoC designs. A typical AMBA configuration is shown in Figure 1.2.
AMBA is hierarchically organized into two bus segments, System and Peripheral
Figure 1.2: AMBA Bus
bus, mutually connected via bridge that buffers data and operation between them.
Standard bus protocols for connecting on-chip components generalized for different
SoC structures, independent of processor type are defined by AMBA specifications.
4 AMBA does not define the methods for arbitration, instead it allows the arbiter to be designed to suit the application needs. There are three distinct buses specified within the AMBA bus for different applications, namely (i) ASB (Advanced System Bus), (ii)
AHB (Advanced High Performance Bus) and (iii) APB(Advanced Peripheral Bus).
Recently, two new specifications for AMBA bus, Multi-Layer AHB and AMBA AXI, are defined. Multi-layer AHB provides more flexible interconnect architecture (matrix which enables parallel access paths between multiple masters and slaves) with respect to AMBA AHB, and keeps the AHB protocol unchanged. AMBA AXI is based on the concept of point-to-point connection [7][8] [9].
1.1.2 CoreConnect Bus
CoreConnect is an on-chip bus architecture from IBM for System-on-Chip (SoC) designs. Initially developed in 1999, it is a macro-based design platform for efficiently integrating complex SoC designs consisting of processors, system blocks, and periph- eral cores. Macro based design provides numerous benefits during logic entry and verification, but the ability to reuse intellectual property for a standard or custom
SoC design is often the most significant one. By using common or generic macros for serial ports to complex memory controllers and processor cores, the design of a complex SoC can be easily accomplished. A typical connection scheme for CoreCon- nect Bus is shown in Figure 1.5. IBM’s CoreConnect is a hierarchically organized architecture consisting of three buses for interconnecting cores, library macros, and custom logic. The main elements of three buses are (i) Processor Local Bus(PLB),
(ii) On-chip Peripheral Bus (OPB), a bus bridge, and the (ii) Device-Control Reg- ister Bus (DCRB). The PLB and OPB buses provide the primary means of data
5 Figure 1.3: CoreConnect Bus
flow among macro elements. PLB is the main system bus for high performance pe- ripherals. It is synchronous, multi-master, and centrally arbitrated bus designed for achieving high-performance and low latency on-chip communication. Slower periph- eral cores connect to the OPB bus. The OPB bus is a secondary bus architected to alleviate system performance bottlenecks by reducing capacitive loading on the PLB.
Peripherals suitable for attachment to the OPB include serial ports, parallel ports,
UARTs, GPIO, timers and other low-bandwidth devices. the PLB masters gain ac- cess to the peripherals on the OPB bus through the OPB bridge macro. DCRB bus, the third one in the system is a single-master bus mainly used as an alternative for relatively low speed data. For example lower performance status and configuration registers are typically read and written through the DCRB bus. The DCRB provides a maximum throughput of one read or write transfer every two cycles and is a fully synchronous. It is typically implemented as a distributed multiplexer across the chip.
CoreConnect implements arbitration based on a static priority with programmable
6 priority fairness. CoreConnect through this configuration can provide an efficient interconnection of cores, library macros, and custom logic for any SoC design.
1.1.3 WishBone Bus
The Wishbone System-on-Chip (SoC) interconnect bus architecture is developed by Silicore Corporation [3]. It is an opencore architecture for portable IP cores with a portable interface for use with semiconductor IP cores. It employs 8-bit to 64-bit standard buses to interconnects portable IP cores such as CPUs, processors, DSPs and other peripheral cores. It defines two types of interfaces, called master and slave.
Master interfaces are IPs, capable of initiating bus cycles, whereas slave interfaces are capable of accepting bus cycles. As shown in Figure 1.4 the hardware implementation
(a) Point-to-Point (b) Crossbar(switch) (c) Shared bus
Figure 1.4: WishBone Bus Interconnection Architectures (Silicore 2002) [3]
for WishBone bus supports various types of interconnection topologies such as (i)
Point-to-Point interconnection - for direct connection between two components (ii)
Shared Bus - typical for MPSoCs organized around single system bus (iii) Crossbar
Switch Interconnection - usually used in MPSoCs when more than one master can simultaneously access several different slaves. With different bus architectures along
7 with a good arbitration mechanism, such as a Priority bus, TDMA bus, Round Robin bus, and relatively new Lottery bus, and QoS mechanism, provide a generic backbone for efficient interconnection between system components. In applications where two buses are required, one slow and one fast, two separate wishbone interfaces could be used. Designer can also choose the arbitration mechanism and implement it to fit with the needs of the application.
1.1.4 SiliconBackplane MicroNetwork
Sonics0s SiliconBackplane MicroNetwork is a quasi on-chip bus to which users at- tach intellectual-property blocks to create system-on-chip designs. SiliconBackplane
MicroNetwork is a heterogeneous, integrated network that unifies, decouples, and manages all of the communication between processors, memories, and input/output devices. Figure 1.5 shows a SoC design using MicroNetwork architecture. MicroNetwork
Figure 1.5: Silicon Backplane Micro-Network Bus Architecture
isolates the system of IP blocks from network by requiring all blocks to use single bus interface protocol known as Open Core Protocol (OCP). The OCP defines a com- prehensive, bus-independent, high-performance, and configurable interface between
8 IP cores and on-chip communication subsystems. OCP enables SoC designers to in- tegrate IP cores in a plug-and-play fashion. Every IP block communicates through a wrapper, which MicroNetwork calls an agent through OCP. For changing system requirements, MicroNetwork support modifications of many bus parameters in real time. System requirements relates to, selection of arbitration scheme, definition of address space etc. A new agent is generated using the tool Fast Forward Develop- ment Environment, developed by Sonics Inc. When compared to a tradional bus architecture, Sonics SiliconBackplane has the advantages of higher efficiency, flexible configuration, guaranteed bandwidth, latency and integrated arbitration.
1.2 Limitations of the Bus based Communication Approach
In the earlier concepts of SoCs the main goal was to copy the system implemented on a PCB with discrete components onto a single silicon chip by adopting the same bus architectures as those used in the PCB. Previously these components were few and shared bus approach was sufficient, but now with the more complex SoC designs, the number of IP blocks is keep increasing, and as a result the performance of this shared bus approach is unfit for designs with larger numbers of IPs. Some of the published data related to AMBA, CoreConnect and Network is shown in Table 1.1.
In a shared bus approach, arbitration is used among several requesters and when more and more components are attached to the same single bus, it introduces an increased load on the bus, and as a result the speed of the bus drops. Additionally, large SoC design usually have large die sizes, and in a bus based communication, some control signal need to traverse the whole bus length several times within a single clock cycle and with technology shrinking in deep nanometer regime, interconnect delay is
9 Table 1.1: Bus Architecture Specification [1]
Technology AMBA CoreConnect WishBone MicroNetwork Company ARM IBM Silicore Corporation Sonics Core Type Soft/Hard Soft Soft Soft Bus Width (bits) 8-1024 32/64/128 8-64 16 Frequency 200 MHz 100-400 MHz 55-203 MHz 300 MHz Max Bandwidth 3 GB/s 2.5-24 GB/s 0.1-0.4 GB/s 4.8 GB/s Min Latency 5us 15ns n/a n/a
exponentially increasing making it nearly impossible to achieve this target in one clock cycle. To explain this in more detail, an example is worked out below.
• An Example - In a bus based communication system, even if the arbitration
is pipelined and takes place in an earlier cycle, the request for the bus still
must be or’ed between the components and then fanned out to all the receiving
components. The intended receiver must decode the request, decide if it is tar-
geted to it, and then issue an acknowledgment that must be registered by all
the components on the bus. This is a typical scenario of a bus based communi-
cation. A possible bus layout scheme for two different designs and technology
nodes is shown in Figure 1.6. Longer bus takes more time and are more slower.
Additionally as the number of component added on the bus increases, the speed
of the bus drops further and offsetting all the advantages of more functionality.
For calculation purposes, a die size of 10mm is considered. The first case shows
4 cores in 65nm process and the second is 8 cores in 45nm. The 8 cores accounts
for the scaling effects. In the first case bus must expand at least 10mm in order
10 Figure 1.6: Bus Layout Schemes
to provide connectivity to all the cores on the chip and just with the technology
scaling the length needed in the 8 core chip is for the same die size is now 28
mm to reach all the 8 cores on the chip.
Due to process techniques improvements and restrictions, newer buses do not utilize tristate buffers - usually they make use of Mux’s or Ors to combine inputs together and then fan-out the result. Buses have many wires and create congestion as these wires must converge at the Mux block. To overcome the congestion usually Muxes are implemented in a distributive manner, but this increases the number of logic level used in the design and may lead to increased delays or lower operating frequency.
Moreover, with more number of IPs, the load of the bus increases further and speed becomes an issue. To solve this problem, a new communication approach, Network- on-Chip (NoC) has been proposed to solve this on chip communication problem. A
Network-on-chip (NoC) is an efficient on-chip communication architecture for System- on-Chip architectures. It enables integration of a large number of computational and storage blocks on a single chip.
11 1.3 Why NoC ?
As discussed earlier, constantly shrinking process technologies and increasing de- sign sizes have led to highly complex billion-transistor ICs. We passed the point long ago when even a large team can design an entire chip from scratch and in a reason- able amount of time. Faced with this challenge, designing systems using Intellectual
Property (IP) modules has become a dominant mode of chip design today. An early form of IP was the standard cell, which dates back to the early 1970s. Today IP components include the entire range of modules, ranging from standard cells to pro- cessors, accelerators, memories, and I/O devices. An example for SoC design is shown in Figure 1.7. The similar IP style design trend follows for the processors, the ever-
Figure 1.7: A SoC Design
increasing demand for performance countered by Power Wall in the decade of 1990s, designing processors with many cores has been widely accepted in the industry. The number of cores in a general purpose processor is expected to scale to several tens and possibly over a hundred cores by the end of the decade, leading towards the aggregate
12 performance in trillions of operations per second. A typical implementation of such a processor will include, multiple levels of cache memory hierarchy and interfaces to off-chip memory and I/O devices in addition to tens or hundreds of general-purpose cores. The future trend for the number of cores in an SoC type designs projected by
ITRS 2009 is shown in Figure 1.9. In SoC designs, much of the added value comes
Figure 1.8: SoC-based consumer portable design complexity trends [4]
from the ability to identify the right combination of components to be put on the chip.
Many of those components are standardized - either they are based on open standards or they are licensed from IP providers who own a standard. Large productivity gains can be achieved using this SoC/IP approach. The complexity of these designs can range anywhere from homogeneous to hetrogenous in nature as shown in Figure 1.8.
Homogeneous topologies are typically the design choice for High Performance Chip
13 Figure 1.9: SoC Design Space
Multiprocessors. Traditionally, bus-based architectures have been used to intercon- nect small number of IP cores in SoC design. However with increased number of components or cores on a single die, the bandwidth requirement between the cores in
SoCs is increasing as well. To overcome the increased communications demands, the bus-based architectures have evolved over time from a single shared bus to multiple bridged buses and to crossbar-based designs to an extent. However, despite these im- provements shared bus approach is not suitable for SoC designs containing thousands of such cores. These problems arise from non-scalable global wiring delays, failure to achieve global synchronization, and difficulties associated with non-scalable bus-based functional interconnects. Bus based architectures are inherently non-scalable. More and More components introduce an increased load on the bus and as a result speed drops drastically. The interconnect complexity of current and future SoC design, not only requires scalability, but also require reliability, performance and reusable inter- connect architectures to increase productivity. Thus a network based interconnect architecture or NoC is needed to overcome this communication bottleneck and the
14 other associated challenges. NoC has been shown to improves on chip communication
through the aid of specific interconnection topologies and packet based communica-
tion. NoC is scalable by nature and has huge potential to handle growing complexity
of SoC designs.
1.4 NoC Design Considerations and Challenges
The concept of NoC is inspired from the success of computer networks. The main
idea behind NoC is to Route Packets not Wires to ease on chip communication chal- lenges. Although the idea is borrowed from this well-established domain of computer networking, but it is not possible to reuse all the features of this classical network for on chip implementation. In particular, NoC switches should be small, energy efficient, and fast in contrast with routers used in computer networks. A tremendous research efforts both by industry and educational institution are being put in this direction to properly model and enhance NoCs for it to be practical and feasible in the future
SoCs or MpSoC (Multi-Processor System on Chip) designs containing hundred or thousands of cores. This thesis is one such effort in hopes to serve for the same goal.
As one can imagine, there are many design considerations and challenges involved in the design process for NoCs. The design issues span several abstraction levels, ranging from high-level application modeling to physical layout level implementation.
For example, one of the main challenge at architecture level is to find the best suit- able interconnect topology to satisfy, meet or exceed the design goal expectations set forth by the SoC system. The design choices made at any level in the design pro- cess beginning from architecture, can have a strong impact on the fesiability of the
15 network, timing closure and overall system performance. Some of the main design considerations and issues in NoC modeling are as follows
• System Level Design Considerations Designing an efficient NoC archi-
tecture, while satisfying the application performance constraints is a complex
process. For NoC communication, many different topologies or configuration
schemes are possible for interconnecting network switches with the cores and
with each other. The choice of NoC architecture and its design can have a large
impact on the performance, power consumption, throughput, latency and effi-
cient usage of area on the silicon chip. At architecture level, the main task thus
is to identify an appropriate topology based on the design needs and constraints.
Many possible solutions have been proposed and implemented recently, but too
often on-chip networks are built using mesh and ring topologies. In fact, most
of the NoC implementations to date utilized mesh topology or its derivative due
to its simplicity and scalability. However, other topologies must be investigated
for a wide range of applications with different design goals in terms of area,
power and bandwidth.
• Power Budget The performance of any NoC design is highly bounded by the
power consumption. NoC communication will not only be applied to high end
designs like servers and desktop applications, but also to very small devices
like mobiles and/or other wireless communications devices. The NoCs for high-
end applications, such as supercomputers and home entertainment server, need
low power design because of the associated thermal issues which may require
expensive packaging and cooling equipments. Similarly, NoCs for mobile and
16 wireless applications may have more stringent low power requirements to guar- antee a reasonable operation time with a limited battery, because more powerful applications such as 3D graphics games, navigation and image recording and processing applications are being implemented in the hand held devices, which are communication intensive and power hungry applications. Today some of the most powerful microprocessor chips can dissipate as much as 100-150 Watts for an average power density of 50-75 watts per square centimeter. Local hot spots on the die can be several times higher than this. Figure 1.10 shows growing power density trend in Intel Microprocessors. As mentioned earlier, the in-
Figure 1.10: Power Density in Intel´sMicroprocessors
creased power density not only presents packaging and cooling challenges, but also can pose serious problems associated to reliability. The mean time to failure decreases exponentially with temperature, every increase of 10C in operating temperature cuts product lifetimes in half. In addition performance of the chip
17 degrades with temperature and leakage power increases with temperature. Un-
til recently power has been a second order concern in chip design, following first
order issues such as cost, area and timing. Today, for most SoC designs, the
power budget is one of the most important design goals of the project. Exceed-
ing the power budget can be fatal to a project, whether it means moving from
a cheap plastic package to an expensive ceramic one, or failing to meeting the
required battery life. For virtually all applications, reducing power consumed
by SoCs at every level in the design including NoC is essential in order for it to
continue add performance and features.
• Interconnect-dominated Nanometer Design Interconnect play an impor-
tant role in any on-chip communication including NoC. The success of NoC
greatly depends on the performance of its interconnects. With ever shrink-
ing geometries, gate delays are lowering, but global interconnects are becoming
the principal performance bottleneck in terms of communication latency, cost
and power. Interconnects in deep nanometer regime pose severe challenges to
meet targeted system performance and reliability. Figure A.2 shows wiring
delay vs. gate delays (From ITRS 2007 report). In technologies 90nm or be-
low, wiring capacitance dominates gate capacitance, thus rapidly leading to in-
creased interconnect-induced delays. Moreover, coupling capacitance becomes
significant between adjacent wires due to tighter geometries and and must be
accounted for in advance. Interconnect optimization must be considered at all
levels of design abstraction in NoCs. In conventional IC design flow much em-
phasis is given to device and logic optimization, while the interconnects are left
for automatic layout tools. As a consequence, traditional topdown approach
18 Figure 1.11: Gate and Wiring Delay Vs. Future Technology Nodes [5]
taken in any traditional VLSI design may not be an effective approach for NoC
designs. In an interconnect-centric design such as NoC, a careful modeling in-
cluding interconnect planning, interconnect synthesis, and interconnect layout
with a focus on interconnect optimization is essential.
• NoC Quality-of-Service (QoS) Challenges and Cost The challenge of
designing a NoC lies in finding a balance between the NoC services and their
implementation complexity and cost. In NoC, QoS refers to the level of commit-
ment for packet delivery; such a commitment could be in the form of correctness
of the transfer, completion of the transaction, or bounds on performance. In
most cases, however QoS actually refers to bounds on performance (bandwidth,
delay or latency etc.) since correctness and completion are the basic require-
ment of on-chip packet transfers. Correctness is concerned with packet integrity
19 (corruption-less) and in-order transfer of packets from source to intended des-
tination. This is achieved by different techniques such as error correction, en-
suring in order packet delivery by reordering the packets. Completion requires
that packets are not dropped or lost when being transferred from the source
to destination. In terms of bounds on performance, QoS can be classified into
three basic categories; best effort (BE), guaranteed service (GS), and differen-
tiated service (DS). In best effort service, only correctness and completions of
communication are guaranteed, and no other commitment is provided. Pack-
ets are delivered as quickly as possible over a connectionless (packet switching)
network, but worse case times are not known or provided, and can be orders
of magnitude worse than average case. A GS, such as guaranteed through-
put (GT), makes a tangible guarantee on performance , in addition to basic
guarantees of completion and correctness. GS is typically implemented using
connection- oriented (circuit switching)switching. A DS priorities communi-
cation according to different categories probably through NoC switches, which
can employ priority based scheduling and allocation policies. All these solutions
require pre-planning as per design constraints, as most of them increase power
consumption and the cost of NoC [10].
• Lack of Tools and Benchmarks- NoC design space is enormous, with nu-
merous topologies and protocols/parameter choices, switching strategies, flow
control, congestion control schemes, buffer sizing, packet sizing, and link sizing
etc. Due to being in early stages of research, the area still lacks in design space
exploration and implementation tools [10]. The NoC design flow needs to be
integrated with the industry standard automation tool flows. There is a need
20 for open benchmarks [11] to compare among different NoC designs for perfor-
mance, cost, reliability and many other features. Computer-aided synthesis of
NoCs is particularly important to design and select the best performing design
in a reasonable amount of time.
1.5 Organization of this Thesis
The rest of this thesis is organized as follows. In Chapter2, an overview of NoC architectures along with its performance and cost models is presented. Chapter3, presents NoC router design, its impact on the overall power consumption, area and system performance. In addition, some low power design techniques applicable to
NoC router design are discussed and evaluated to achieve low power router design.
In Chapter4, the impact of technology scaling on the NoC interconnects is discussed and some schemes to optimize performance are presented. Additionally, an efficient design flow based on interconnect modeling is presented to achieve high performance and low power NoC interconnects. In chapter5, high level power models for different
NoC architectures are presented to estimate power budget in the eraly phase of design cycle. Chapter6, presents an interesting study to achieve low power NoC design based on the activity numbers. Asynchronous vs. Synchronous design is evaluated.
21 CHAPTER 2
NoC Overview : Architecture, Performance and Cost
NoCs are now being considered by many as a viable alternative to design scalable
communication architecture for present and future generation of SoC [32] designs. In
multimedia processors, inter-core communication demands often can scale up in the
range of GB/s and this demand is expected to peak with the increasing integration
of many heterogeneous/homogeneous high performance cores into a single chip. To
Figure 2.1: Conceptual view of Network-on-Chip [6]
meet such increasing bandwidth demands, state-of-the-art buses such as AMBA and
CoreConnect have been instantiated using multiple buses operating in parallel thereby providing a crossbar-like architecture, which still remains inherently non-scalable with
22 low performance. To effectively tackle the interconnect complexity of modern SoC designs, a scalable and high performance interconnect architecture is needed and hence, NoCs [12].
2.1 NoC Building Blocks
The most important components forming the NoC architecture are the Network
Interfaces, the Routers and the Links. An NoC is formed by interconnecting these network elements in different configurations to form a topology. A sample topology is shown in Figure 2.2. The topology may either be specific, such as a mesh or a ring, or arbitrarily connected to match the requirements of the target application.
A network interface includes packetizing and de-packetizing logic for packet based
Figure 2.2: Network-on-Chip
communication. The arbitration for different flows happens at the routers, which decides which master/source gets the priority on the links downstream. These basic building blocks of NoC are explained in more detail in the following sub-sections benini.
23 2.1.1 Network Interfaces
A Network Interface (NI) is needed at each node to connect the IP(core) to the
NoC. Network Interfaces convert transaction requests into packets for injection into the network and receive packets in response to other transactions in the network.
When transmitting the packets, they are split into a sequence of FLITS (Flow Control
Units), to minimize physical wiring in the network. This flit width can be static or configurable based on the system requirements. For example, a flit width can vary from 4 wires to even 200 wires including data and control lines depending on the needs of the system. Network Interface also provide buffering at the interface to improve network performance.
2.1.2 Switches
The medium of transportation of packets in the NoC architecture are the switches, which route packets from sources to destinations. Switches are fully parameterizable in the number of input and output ports. Switches can be connected arbitrarily and hence any topology, standard or custom can be configured. A crossbar is used to connect the input and output ports of a switch. The switches are also equipped with an arbiter to resolve conflicts among packets from different sources, when they overlap in time and request access to the same output link. An arbiter most often is implemented using either the round-robin or the fixed priority scheduling policy.
In switches input and output buffering is used to avoid deadlock, lower congestion and improve performance [13]. The buffering resources are instantiated depending on the desired flow control protocol. If credit-based flow control is chosen, only input buffering is necessary. Output buffers can still be deployed to decouple the
24 propagation delays within the switch and along the downstream link. The downside to this is second cycle of latency and additional area and power overhead.
2.1.3 Links
In NoC there are two major signal paths, namely from router-to-router and from router-to processing element (PE). Compared to the short-length local wires inside
PEs and routers, the global wires between routers and the semi-global wires between a router and a PE form a critical part of NoCs. As semiconductor technologies are shrinking, NoC links play a major role in overall system performance, and is an area of active research today.
2.2 NoC Architectures
An NoC architecture (or topology) specifies the physical arrangement of the in- terconnection network. It defines how nodes (IPs), switches (routers), and links are connected to each other. The nodes may be of same type e..g processing cores or different types e..g audio cores, video cores, wireless transceivers, memory banks etc as shown in Figure 2.2. Each IP is connected to local router through a Network In- terface (NI) module. The NI module packetizes/depacketizes the data into and from the interconnection network. The PE together with the its NI forms a network node.
Nodes communicate with each other by injecting data packets into the network. The packet traverse their destination, based on various routing algorithm and control flow mechanisms. NoCs architectures can be classified into three broad categories - Direct networks, indirect Networks, and irregular networks [10]. There are 11 standard topologies applicable to Noc, and are shown in Figure 2.3. For this research, only 1
25 (a) Cliche (b) Torus (c) Folded Torus
(d) Octagon (e) Ring (f) Spidergon
(g) Binary Tree (h) BFT (i) SPIN
(j) Hypercube (Hcube) (k) Star
Figure 2.3: Eleven Standard NoC Topologies
26 or 2 topologies from each group type are selected to compare and contrast and are
discussed in more detail below
2.2.1 CLICHE´
Kumar et al. [14] proposed Chip Level Integration of Communicating Heteroge- neous Elements (CLICHE)´ topology. It is a 2D mesh consisting of m × n mesh of
switches; interconnecting Intellectual Property (IP) Elements. A mesh consisting of
16 IPs is shown in Figure 2.4. Every switch except those at the edges are connected
Figure 2.4: CLICHE´ Architecture
to four neighboring switches and one IP block. In a 2D mesh number of switches is
equal to number of IPs. The switches and the IPs are connected through communi-
cation channels. A channel consists of two unidirectional links between two switches
or between a switch and a resource. CLICHE´ topology is being widely used in NoC designs because of its simplicity, regular structure and shorter inter switch wires, thus making it more suitable for tile based architectures. In this topology, under utilization of links may result in the event of localized traffic, because in some particular cases
27 not all PEs may have the same communication requirement. This leads to mapping inefficiencies and wastage of resources.
2.2.2 TORUS
Dally and Towels [15] proposed 2D Torus as shown in Figure 2.5 for NoC archi- tecture. A 2D torus is basically the same as a regular mesh; except that switches at the edges are connected to the switches at the opposite edge through wrap-around channels. Every switch has five ports, one connected to the local resource and others connected to closest neighboring switches. The number of switches in a torus topology is equal to number of PEs. The main drawback associated with this topology is the
Figure 2.5: Torus Architecture
longer end around connections, as they can yield excessive delays. However, this can be avoided by folding the torus, which leads to a more suitable VLSI implementation.
Some of its advantages over a mesh based architecture are: (i) Smaller hop count, (ii)
Higher bandwidth, (iii) Decreased contention with (iv) Optimized chip space usage.
28 2.2.3 BFT
Pande et. al [16] proposed Butterfly Fat Tree (BFT) as a NoC topology as shown
in Figure 2.6. A BFT architecture, is a modification to the fat tree architecture. In
this network, the IPS are placed at the leaves and switches placed at the vertices.
Each switch has four child ports and two parent ports. The IPs are connected to N/4
Figure 2.6: BFT Architecture
switches at the first level. The number of levels depends on the total number of IPs
i..e for N IPs, the number of levels will be log4N. In the jth level of the tree there
are N/2j+1 switches. The number of switches in the butterfly fat tree architecture converges to a constant independent of the number of levels. If we consider a 4-ary tree, as shown in Figure 2.6 with four down links connected to child ports and two up links connected to parent ports, then the total number of switches at level 1 is N/4.
The most common drawback of a tree based topology in general is that the root node
or nodes close to it become a bottleneck. However, the bottleneck can be removed
by allocating a higher channel bandwidth to channels located close to root nodes.
29 2.2.4 SPIN
Guerrier and Greiner [17] proposed a generic architecture called Scalable, Pro- grammable, Integrated Network (SPIN) for NoC communication network. SPIN makes use of fat tree topology. Every node has four children, and the parent node is replicated four times at any level of the fat tree. A basic SPIN architecture with 16 nodes (IPs) is shown in Figure 2.7. This topology has some redundant paths, there-
Figure 2.7: SPIN Architecture
fore offers higher throughput at the cost of added area. It is scalable and uses small number of routers for a given number of IPs. It has a natural hierarchical structure which may be suitable for some particular applications.
2.2.5 OCTAGON
Karim et al. [18] proposed Octagon architecture for NoC. A basic Octagon config- uration includes eight nodes and 12 bidirectional links as shown in Figure 2.8. Each node is associated with an IP and two neighboring switches. Communication between
30 any pair of nodes, takes in at most two hops within a basic Octagon unit. It is a scalable architecture, and for a system containing more than eight nodes, an Octagon architecture is extended by interconnecting multiple basic Octagon units with a single node in common. The main disadvantage of this topology is that, if only one more
Figure 2.8: Octagon Architecture
node is needed beyond an 8 node cluster, 8 more nodes are added to the design to make it a complete instead of just one or few nodes. Still, in some cases this topology may prove to be useful.
2.3 NoC Flow Control Protocols
Flow control allocates network resources to the packets traversing the network and provides solution to network contentions. In NoC, flow control is important as it determines : (a) the number of buffering resources required in the system; an efficient flow control will minimize the number of required buffers and their idle time and (b) the latency that packets incur while traversing the network, which is useful under heavy traffic conditions, where fast packet propagation with optimum resource utilization is the key for time sensitive data in the network. The flow control may
31 be buffered or bufferless. The buffered flow control is more advantageous in terms of lower Latency and higher Throughput. Four different types of buffered-based flow control protocols for NoC are:
• CREDIT Based, is a availability based flow control where an upstream node
keeps count of data transfers, and the available free slots are termed as credits.
Once a transmitted data packet is either consumed or further transmitted, a
credit is sent back. Bolotin et al. [19] used Credit Based Flow Control in QNOC,
a QoS based hardware efficient SoC integration mechanism for NoC.
• ACK/NACK, is a retransmission based flow control where a copy of transmitted
flit is kept in a buffer until an ACK/NACK signal is received. If an ACK signal
is received, the flit is deleted from the buffer and if a NACK signal is received the
flit is re-transmitted. Bertozzi et al. [20] used this flow control retransmission
in Xpipes, a NoC implementation.
• STALL/GO, is a simple variant of credit based flow control, where a STALL
is issued based on the status of the buffer downstream when there is no buffer
space available, else a GO signal is issued, indicating availability of buffer space
to accept the next transaction. In this scheme two wires are used for the flow
control between each pair of sender and receiver.
• HANDSHAKING Signal, is a message based flow control where a VALID signal
is sent whenever a sender transmits any flit. The receiver acknowledges by
asserting a VALID signal after consuming the data flit. Zeferino et al. [21] used
handshaking signals in their SoCIN NOC implementation.
32 2.4 NoC Switching Techniques
In NoC switching technique determines how data flows through the routers in the network. It defines the granularity of data transfer and its applied mechanism.
Messages generated by the source node are partitioned into several data packets. A packet is further divided into multiple flits (Flow Control Unit). A flit is an elemen- tary packet on which link flow control operations are performed, and is essentially a synchronization unit between routers to ensure each data transfer. Each flit is made up of one or more phits (physical unit). A phit is a unit of data that is transferred on a link in a single clock cycle. Typically, the size of phit is the width (in bits) of the communication link [10]. Different NoC architectures use different phit, flit, and packet message sizes. The choice of size can have a significant impact on cost, per- formance, and power for NoC fabrics. As shown in Figure 2.9, the two main modes
Figure 2.9: NoC Switching Techniques
of transporting flits in a NoC are Circuit Switching and Packet Switching. These two techniques are discussed in more detail next.
33 1. In Circuit Switching, a physical path between the source and the destination
is reserved prior to the transmission of data. The physical path consists of a
series of links and routers, and the message is sent in its entity to the receiver
once a path(circuit) is established. A message header flit traverses the network
from the source to the destination, reserving links along the way through the
routers. If the header flit reaches the destination without any conflict, all the
links in the path are available and an acknowledgment is sent back to the sender
from the receiver. Upon receiving a confirmation, sender sends out the data on
the reserved path. The path is held until all the data has been transmitted. At
the end, a tail flit frees the resources for other transmissions. If a link is busy,
however a negative acknowledgment is sent to the sender for further actions.
The main advantage of this approach is that the full link bandwidth is available
to the circuit once it has been setup, which results in low latency for data
transfer. On the other hand the main drawback of this approach is that, it is
not scalable with the size of the network as several links can be occupied for
the duration of the transfer. Circuit Switching is implemented in the SoCBUS
NoC architecture [22].
2. In Packet Switching no path is reserved before sending any data, the pack-
ets are transmitted from the source and they make their way independently to
the receiver. In circuit switching, there is a start-up waiting time for the data
for path reservation followed by fixed minimal latency in the routers, whereas
Packet switching has zero start-up time followed by variable latency due to con-
tentions in the routers. There are three different packet switching techniques:
34 (i) Store and Forward (SAF), (ii) Virtual Cut Through (VCT), and (iii) Worm- hole (WH) switching. These are discussed in more detail below
(i) In Store and Forward (SAF) Switching, a packet is sent from one
router to the next only if the receiving router has buffer space for the
entire packet. Hence, packet transmission can not stall and there is no
concept of flit (flits are equal to packets). Routers forward a packet only
when it has been received in its entirety. The buffer size in the router
is at least as big as the size of the packet is. Because of large buffer
size requirement, this technique is not commonly used in NoCs. However,
Nostrum [14] NoC makes use of this technique.
(ii) In Virtual Cut Through (VCT) Switching a flit of a packet is for-
warded as soon as space for entire packet is available in the next router,
thereby reducing the per router latency. The other flits then follow with-
out delay. If no space is available then whole packet is buffered. Buffering
requirements of SAF and VCT switching are the same.
(iii) In Wormhole(WH) Switching buffer requirements are reduced to one
flit, instead of an entire packet. A flit form a packet is forwarded to
the receiving router if space for that flit is available in the router. In
this scheme, a packet in transit could be distributed among two or more
routers at a given time. This blocks the link, which results in higher
congestion than with SAF and VCT switching. However link blocking
can be alleviated by multiplexing virtual links (or virtual Channels). WH
is also more susceptible to deadlocks, due to interdependencies among
35 routers, but it can also be avoided by virtual channels and routing schemes.
Figure 2.10 shows the major difference in SAF and WH routing in terms
of buffer and timing needs. Due to less buffering requirements, almost all
NoC use WH switching.
Figure 2.10: Store & Forward Routing Vs. Cut-Through Routing
2.5 NoC Routing
The NoC routing is responsible for correct and efficient routing of packets(or cir- cuit flows) that are traversing the network from sources to destinations. It determines the paths over which data follows through the network. There are several different routing schemes in the research literature, but some of the major schemes are (i)
Static or Dynamic, (ii) Distributed or Source and (iii) Minimal or Non-Minimal rout- ing. These are discussed in more detail below
(i) In Static Routing (also known as oblivious or deterministic) scheme perma-
nent paths from a source to a destination are defined and are used regardless
36 of the current state of the network. This routing scheme does not take into
account the current load of network links and routers when making routing de-
cisions. Whereas, in a Dynamic Routing scheme, routing decisions are made
according to current state of the network (load, available links). As a result,
the traffic between a source and a destination may change its route with time.
Static Routing is simpler to implement in terms of router logic and interactions
between routers. A major advantage of a single path static routing is that all
packets with the same source and destination are routed over the same path
and can be kept in order. In this case, there is no need to number and reorder
the packets at the network interface. Static Routing is more appropriate, when
traffic requirements are steady and are known and Dynamic routing is appro-
priate when there is more and traffic conditions are unpredictable [23]. Both
Static and Dynamic routing techniques can be further classified based on where
the routing information is held and where routing decisions are made.
(ii) In Distributed Routing, each packet carries the destination information. The
routing decisions are implemented in each router either by looking up routing
tables or by executing a hardware routing function. In this method, each router
in the network contains a predefined routing function, whose input is destination
address from the packet and its output is routing decision. When a packet arrives
at the input port of the router, its output port is looked up in the table or cal-
culated by the routing logic according to the destination address [23]. However,
such an implementation is only practical for very small systems because table
size increases linearly with network size. In a Source Routing scheme, source
nodes predetermine complete routing paths before injecting packets into the
37 network. The pre-computed path is stored in the message header and switches
simply read the routing information. As for the implementation, it is by means
of routing tables (or look up tables) stored at the end node. This solution
allows to route messages in any irregular topology configuration, as header in-
cludes the entire path. However, this schemes consumes network bandwidth
as it is transmitted through the network. Additionally, the size of the look-up
tables at every end node grows linearly with the system size, and quadratically
with respect to NoC size citeflich. Examples of real NoCs using source based
routing implementations are xpipes [24] and Intel Polaris Chip [25].
(iii) Based on the number of hops a message takes, routing schemes can further be
classified as Minimal and Non-Minimal distance routing. A routing is minimal
if the length of the routing path from the source to destination is the short-
est possible length between the two nodes. In a Minimal Routing scheme,
a source does not start sending packets if a minimal path is not available. In
contrast, a Non-minimal Routing scheme, there is no such constraint, and
messages can take longer paths if minimal path is not available. By allowing
non-minimal paths, the number of alternative paths is increased, which can be
useful for avoiding congestion, hot spots and fault tolerance etc. However, the
non-minimal routing can have an undesirable overhead of additional power con-
sumption, and it can be prohibitively expensive for a large scale NoC design [10].
2.6 NoC Performance and Cost
The main function of the NoC design is to transfer information from any source node to any desired destination node. It should be able to accomplish this task in
38 as less time as possible, and it should allow large number of such transfers to take place concurrently. Thus, it is highly desirable for a NoC design to exhibit high performance (High Throughput, and Low Latency) with low cost (Low Power, and
Smaller Area overhead). As with any digital design, most of these metrics trade off with each other, and require a careful balance among all. A more detailed discussion on these metrics is presented below
2.6.1 NoC Power Dissipation
Until recently, power has been a second order concern in chip design, following
first order issues such as cost, area and timing. Today, for most SoC based designs, the power budget is one of the most important design goals of the project. Exceeding power budget can be fatal to a project, whether it means moving from a cheap plastic package to an expensive ceramic one, or causing an unacceptably poor reliability due to excessive power density, or failing to meet the required battery life. These problems are expected and/or becoming worse for process geometries 90nm or below.
For example, leakage power alone has become a significant part of the total chip power; reaching almost 40% in generic 65nm technology. Reducing power wherever possible is almost essential for any design including NoCs. The concept of NoC is not only expected to be applied in high end SoC designs but also in very small devices like mobile and wireless communication devices. For virtually all applications, reducing the power consumed by NoC is essential in order for it to be feasible and successful. In
NoC, power is dissipated, when flits travel in the network, both inter switch wires and the logic in the switches toggle. A high level power models and an IP based efficient design methodology to achieve low power NoC design is presented in Chapter5.
39 2.6.2 NoC Area Overhead
Area consumed on silicon directly relates to associated cost of the design. In NoC design, the silicon area overhead arises due to the presence of the switches, large num- ber of global interconnects and repeaters. In longer interconnects repeater insertion is necessary to manage the inter switch delay within one clock cycle. The total number of repeaters required depends on the length, and total number of interconnects in the network. Additionally, the spacing between interconnects usually is optimized for sig- nal integrity issues in deep nanometer technologies. Repeater insertion in addition to optimized interconnects may result in huge area overhead by the NoC. Similarly, NoC switches have two main area sensitive components, the storage buffers and the control logic to implement routing and flow control. The storage buffers are the FIFOs (First in First out) necessary to maintain network performance mainly trade offs with the area. As the SoC design scale the area overhead may become very large or impracti- cal to implement. It has been reported that future SoCs may have as many as 1000
IPs on a single chip. Keeping this in mind, area minimization techniques are highly desirable at circuit level design. A detailed discussion on performance optimization of interconnects for an area limited NoC design is presented in Chapter4.
2.6.3 NoC Message Latency
Message latency is the time elapsed since a message is generated at its source node until that message is delivered at its destination node. An unloaded or Zero load latency of a network is the latency where only one packet traverses the network.
This model does not consider contention among packets. The zero load latency of an
40 NoC with wormhole switching is
L T = N .t + t + p (2.1) network hops sw links B
Where, first term represents the routing delay, tlinks is the total propagation delay of communication channels or inter switch links, and the third term is the serialization delay of the packet [26]. In more details, Nhops is the avg. number of hops a packet traverse to reach the destination node. tsw is the switch delay and can be calculated using N t = sw−cycles (2.2) sw f
Where Nsw−cycles is the number of clock cycles required for packet processing by the switch and fsw is the switch frequency. Lp is the length of packet in bits. B is the
bandwidth of the communication channel and is defined as B = wc.fc. Where wc is
the channel width in bits and fc is the channel frequency. Almost all cases of NoC
have fc = fsw, and with this assumption NoC latency can thus be defined as
Nsw−cycles Lp Tnetwork = Nhops. + tlinks + (2.3) f wc.f
Zero load network latency describes the effect of network topology to the timing
performance. In a Multi-processor SoC (MpSoC) design, message latency directly
affects processor idle time and memory access time.
2.6.4 NoC Throughput
Typically, the performance of a digital communication network is characterized by
its aggregate bandwidth in Bytes/sec, which is a static measure of network capacity.
However, in the case of an NoC it is more important at what rate messages can be
sent or completed by the network, therefore Throughput is more of an appropriate
41 metric for NoC. Throughput of a network is defined as the total number of messages handled by the network per unit time [26]. It is the average rate measure of successful message delivery per unit time, and can be defined as
L T hroughput = tot−msg (2.4) NIP .ttotal where
N t = cycles (2.5) total f thus
L T hroughput = tot−msg × f (2.6) NIP .Ncycles
where Ltot−msg is the total length of the messages measured in bits, that have suc- cessfully reached their destinations. NIP is the total number of network IPs involved in the process, and ttotal is the total time elapsed between the generation of the first message and reception of the last message. It depends on Ncycles ( the number of cy- cles consumed) and the frequency ”‘f”’ of the switch. If the frequency of the switches is increased, the Throughput of network increases as well. The notion of virtual channels mentioned previously helps in achieving higher Throughput by increasing channel utility. Thus, NoC throughput can be improved by making use of virtual channels and by operating switches at higher frequencies. These factors are discussed and explored in more detail in Chapters3 and4.
42 2.7 High-level Physical Characteristics of NoC Architectures
The performance and associated cost of a NoC architecture is directly related to its physical interconnection structure i..e its topology. Each topology offers unique combination of performance (in terms of throughput and latency) and cost (power and area requirement) characteristics. As the number of topologies that can be considered for a NoC design increases, the need to predict the capabilities of each topologies arises. To choose an appropriate topology based on some theoretical information, it is necessary to understand topologies as a graph of nodes and links and the related physical characteristics it offers. Some of the network concepts related to macro networks,such as Switch-Degree, Diameter, Link cost and Average-distance etc. are also applicable to NoC architectures. It is possible to utilize these physical attributes or properties of the networks to perform a high level comparison among different NoC topologies. These parameters are discussed in more detail as follows.
• Number of Switches: is defined as the total number of switches required to
fully interconnect all the nodes with a particular topology. Every topology in
NoC require different number of switches for same number of nodes.
• Switch Degree: is defined as the total number of input/output ports of a
switch. The operating frequency of a switch and its area requirements are
strongly related to this property: the higher the switch degree, the lower the
switch maximum operating frequency and the higher its area cost.
• Network Diameter: is defined as the minimum distance between the farthest
nodes in the network. It defines the maximum routing distance in a minimum
routing scheme. This property, however is completely dependent on the the
43 physical implementation of the topology. Therefore to be more precise for on-
chip networks, it must be defined as the maximum number of cycles between
two cores. Higher the value of this property, the longer messages take to reach
their destinations.
• Link Cost: is the minimum number of links required to fully interconnect
all the nodes in the topology. It defines the wiring cost associated with the
topology. Number of links in NoC is not a critical resource, but associated
delay of a link is a critical factor. Longer interconnects are dependents on link
delay, as they may require several pipeline stages to meet certain target system
frequency. Since the real link delay is dependent on the layout and technology
library used, it is very difficult to provide an accurate estimation of delay based
on link cost. On the other hand, if layout mapping is available this parameter
can be used to get a good implementation cost in terms of metal resources and
number of repeaters required for a given particular topology.
• Bisection Bandwidth: is defined as the smallest aggregated bandwidth ob-
tained by dividing the topology into two equal halves. It is a common measure
of theoretical network performance: the higher the bisection bandwidth, the
better the topology is suited to cope with high traffic loads. Since the primary
goal is to judge bandwidth and the resulting wiring demand; only data lines are
considered in the estimation process [27].
• Average Distance: is the average distance measured in hops by calculating
distance from each node to every node in the network. Higher the value of this
44 parameter, higher is the communication cost in terms of power. More power is
consumed with more distance traversed by the messages.
• Symmetry: A topology is symmetric when the network looks the same from
every switch. A symmetric topology offers more communication path, but at
the same time some of the areas may be under utilized if the traffic is not evenly
distributed.
2.8 NoC Design Flow
NoC design flow consists of a sequence of design activities [28][29][30]. The exact set and order of activities is always product specific. Designing NoCs, so that they meet SoC design requirements is a complex process and requires a careful design strategy. The design choices made at each step will have a significant impact on the overall system performance and cost. Some of the most important phases in designing
NoC include:
• Application Description - This phase is responsible for defining the communi-
cation needs of the system, such as frequencies or bandwidth etc. The general
characterization is done by means of a graph, where each vertex represents the
computation module or IP in the application referred to as task and the edges
denotes the inter dependencies between the tasks. Alternatively a table can be
also be used to represent the applications communication requirements.
• Topology Selection - This phase of design cycle involves exploring various NoC
topologies for design objectives such as communication delay, area and power
consumption etc. The design choices span from standard regular topologies to
45 full custom topologies. The designer could even adopt a hierarchical or a mixed
topology scheme to satisfy the system requirements [6].
• IP Mapping - It is the process of mapping a given set of IP cores onto the tiles
of the selected communication architecture to satisfy the design requirements.
Many different types of algorithms have been proposed to achieve efficient map-
ping of IP cores in NoC architectures; based on Bandwidth, Latency and Power
awareness etc.
• Architecture Configuration - It involves selecting routing and switching schemes,
and fixing buffer sizes etc. Many different heuristic based design techniques
exists to select the values that best suit the architecture’s communication needs
and result in near optimal solution.
• Design Synthesis and Validation - This involves writing the network components
in Hardware Description Language (HDL) using the Synthesis tools. In this
phase standard component libraries for switches and network interfaces can
be used. The cost and performance numbers are obtained from the simulations
and are dependant on the selected network components and their corresponding
configurations. Design Validation of NoC implementation is an important step
to verify the design against initial requirements in terms of communication
latencies, throughput, area and Power.
In order to handle the design complexity and meet the tight time-to-market con- straints, it is important to automate most of these NoC design phases. To achieve design closure, the different phases should also be integrated in a seamless manner.
46 2.9 Summary
Developing a communication system with several or hundreds of processor like resources is a formidable task and involves a careful design considerations. There are many factors that may affect the choice of an appropriate interconnection network topology/architecture for the SoC design. At the architectural level, design space ex- ploration with modeling and synthesis task is a must to fully understand the impact of the selected architecture on the overall performance and cost of the design. There are many parameters in the architectural design phase which can affect the key-trade- offs between performance and power dissipation; such as the length of physical wires, switching techniques employed, buffer allocation, routing algorithms, types of ser- vice(guaranteed/not guaranteed)level, and the implementation of the topology itself.
In this chapter, four main NoC architectures (Cliche, BFT, SPIN and Octagon) are discussed along with their performance and cost models. Some high level architec- tural parameters for various NoC topologies are presented for a generic comparison only.
47 CHAPTER 3
NoC Router Architecture Design and Cost
A router (or a Switch) is one of the main building block in a NoC network. The main function of a router is to make routing decisions and to forward packets that arrive through the incoming links to the proper outgoing links. The high level design of a generic switch required to implement a packet based communication is shown in
Figure 3.1. It mainly consists of input/output FIFO buffers, input/output arbiters,
Figure 3.1: A Generic Router Design
MUX units and control logic [31]. A switch in the network is connected to other switches and to an IP node through interconnects or links forming the channels in
48 the network. The router design critically affects the performance and cost of the whole network in terms of throughput, latency, power and area [32][33]. The number of input and output ports is generally small and is dependent on the architecture type. The time gap between when the information enters the input port to the time when the information leaves the switch through an output port is called switch-delay.
A detailed design of the switch architecture is presented in the following sections.
3.1 Main Parts of NoC Router
A baseline architecture for a NoC switch is shown in Figure 3.1. In this configu- ration, the incoming flits are received by the Link Controller (LC) and are stored in input buffers. The flow control logic is responsible for communicating buffer availabil- ity among neighbor switches, while routing logic determines their output destination based on the information in the header flits [34]. Each incoming packet is directed to the Header Decoder Unit (HDU) to determine its destination. The inputs that are allowed to send data over the crossbar are determined by the switch arbiter that resolves all conflicting requests for the same output ports. Optionally, the flits that cross crossbar may be stored in the output buffers. The operation of the switch thus consists of one or more of the following processes, depending on the nature of the flit.
In the case of a header flit, the processing sequence is: 1) input arbitration, 2) routing, and 3) output arbitration. In the case of body flits, switch traversal replaces the rout- ing process since the routing decision based on the header information is maintained for the subsequent body flits. A switch designed with Virtual Channels (VC), requires a VC allocator to select which output VC the input flits would use when leaving the switch. The design of the router is largely determined by the switching technique
49 supported. Majority of modern commercial routers found in high-performance mul- tiprocessor architectures utilize some form of virtual-cut through switching or its variant like wormhole switching. Some of the main components of a wormhole switch are as follows.
3.1.1 Input/Output Ports
The router for different network topologies require different number of ports. For example, a router for a two-dimensional mesh based topology consists of five in- put/output ports: four ports communicate with neighboring routers, and a fifth port is connected to a processing or storage unit through the network interface block.
Each port of the router is composed of input virtual channels, output virtual chan- nels, header decoder, a crossbar, an input arbiter and an output arbiter. In most
Figure 3.2: An Input Port of the Switch
implementations, the numbers of input and output ports are five; four inputs from the four cardinal directions (North, East, South and West) and one from the Network
50 Interface (NI). To increase channel utility, Partha et al. [13] proposed switches with virtual channels. In a virtual channel switch, each port of the switch has multiple parallel channels made up of buffers, which helps in increasing switch throughput.
Virtual channel are discussed in more detail in the next section.
3.1.2 Virtual Channels
The design of a virtual channel (VC) is an important aspect of NoC. A virtual channel splits a single channel into multiple channels, virtually providing more paths for the packets to be routed. It can range from anywhere from two to sixteen virtual channels. The use of VCs reduces the network latency at the expense of area, power consumption, and production cost. However, there are various other added advan- tages offered by VCs. Since VCs provide more than one output path per channel
Figure 3.3: An Input Port of the Switch with Virtual Channels
there is a lesser probability that the network will suffer from a deadlock; the net- work livelock probability is eliminated (these deadlock and livelock are different from the architectural deadlock and livelock, which are due to violations in inter-process
51 communications). Virtual channels have been shown to improve throughput of the switch, but they may increase latency when too many virtual channel are added. The router design was analyzed to find optimum number of virtual channel for the given design.
3.1.3 Buffers
These are FIFO buffers and are used for storing messages in transit. FIFO, acronym for First in First out, is a concept used to describe the behavior of a buffer.
As the name says, it works according to the first-come first-serve principle. A FIFO fundamentally consists of some storage elements from which data can be read or written to. The storage element can be an SRAM, DRAM, or a set of registers (D
flip flops) or any other form of storage. In most of the NoC implementation register based buffers are used for their small size and for less power consumption. In general, irrespective of the type of memory used, buffering basically helps in managing the data traffic during congestion of packets (increased traffic) and during contention of resources (a situation where two or more sources compete for the same resource at the same time). In a router based on buffer model, a buffer is associated with both the input physical channels and output physical channels. In this chapter, a router design based on FIFO buffer is analyzed for power consumption.
3.1.4 Crossbar Logic
Crossbar connects the inputs to the outputs of the switch. The crossbar is a non- blocking network in the sense that an unused input can always connect to an unused output without destroying the connections of other input/output pairs. The connec- tion realized by the crossbar is determined by the Switch Controller. The crossbar
52 is commonly built with a single multiplexer per output, as shown in Figure 3.4. The
Figure 3.4: A 3x3 Crossbar Implemented using a Multiplexer for each Output
multiplexer are controlled by their select lines, which are actually connected to the grant signals from the arbiters. There are many ways to implement multiplexers for a crossbar design, such as using (i) gates only, (ii) using pass transistors, (iii) using tristate inverters or by (iv) using smaller multiplexers instead of a big one. A cross- bar design is often represented as a 2-D array of transistors that short the input and output lines in order to achieve a connection. As shown in Figure 3.5, this is a sim- ilar approach, the only difference is the layout representation. The signals S[i,j] are used for controlling the tristate buffers for each column of the crossbar. In general, depending on the characteristics of the technology and the logic family being used, different multiplexer implementations may be best [35]. However, in a standard cell logic-synthesis based design environment, some of the multiplexer implementations may not be possible.
53 Figure 3.5: 2-D Implementation of a 3x3 Multiplexer based Crossbar
3.1.5 Input/Output Arbiter
An arbiter is required when multiple inputs use shared resources in the switch.
The arbiter is used for resolving conflicts among requests for the same resource and granting access to only one of them. Besides its basic functionality, the arbiter should grant access in some fair form, a criteria set by the design that provides equal service to the different requesters. An arbiter could have a fixed priority arbitration or variable priority, however to allow a fair allocation of system resources and to achieve high system performance, an arbiter should be able to change the priority of the inputs dynamically. The way priority changes is part of the policy employed by the system’s scheduling algorithm. For example, the most widely employed Round-robin policy states that request served in the current cycle will get lowest priority in the next cycle. A general Variable Priority Arbiter (VPA) can be implemented many ways, and some of the mainstream designs are
54 • A Priority-Encoding based VPA utilize Fixed Priority Arbiter (FPA) as a
main block and additional control logic to make it a VPA. The FPA block does
not allow any dynamic priority change and always position 0 has the highest
priority and position n−1 the lowest. The FPA produces a grant signal G and an
additional flag AG to indicate that at least one input request was granted [35].
t A grant is given to i h request when Ri = 1 and no other requests exist to
any position with index smaller than i. In order to support variable priorities
the arbiter utilize either more FPAs and a selection mechanism, or additional
circuits that help mimic the behavior of a variable priority arbiter design. Some
of the options are for this are(i) Exhaustive, (ii) Rotate-Encode-Rotate and (iii)
Dual-Path designs [36].
• A Carry-Lookahead-based VPA is built using carry-lookahead like struc-
tures. In this case the highest priority is declared using a priority vector P
that is encoded in one-hot form. As in the case of FPA a request is granted
when no other higher priority input has an active request. The main character-
istics of Carry-lookahead based VPAs is that they don’t require multiple copies
of the same circuit and they inherently handle the circular nature of priority
propagation [36].
• A Matrix Arbiter implements a Least Recently Served (LRS) priority scheme.
The matrix arbiter stores more priority bits than simpler VPAs and thus is able
to handle more complex relations among the priorities of each input. As the
name suggest matrix arbiter uses N × N matrix of bits to store the priority
values. Each matrix element M[i,j] of ith row and jth column records the priority
55 of i over j, the symmetric matrix elements M[j,i] should be set equal to zero, and the elements of the diagonal M[i,i] have no physical meaning. Thus only the n(n-1)/2 elements of the priority matrix are actually needed. A high level design for the Matrix Arbiter is shown in Figure 3.6.
Figure 3.6: A Matrix Arbiter Design
According to the operation of the matrix arbiter a requester will receive a grant if no higher priority requester is bidding for the same resource. Once the request succeeds, its priority is updated and set to be the lowest among all requesters.
The grant circuit is then used to allow only one virtual channel to access a physical channel. A separate arbiter can be used for the input and output ports. In a wormhole based switch design, when the granted input virtual channel stores one whole flit, it sends a full signal to the controller. If it is a header flit, the Header Decoder Unit (HDU) determines the destination. Based on the info controller checks the status of the output port. If it is available, the path between input and output is established. All subsequent flit of the corresponding packets are sent from input to output using the established path.
56 If more than one input ports try to access the same output port simultaneously,
an output arbiter is used to allow only one input port to access the output port.
3.1.6 Control Logic
The complexity of Control Logic (CL) depends on the routing and scheduling al- gorithm being implemented. Control Logic determines output port for each incoming packet and arbitrates among inputs if directed at the same output. The controller keeps track of the input and output virtual channels. The operation of the controller is illustrated through the state diagram shown in Figure 3.7. When the input virtual
Figure 3.7: State Diagram of Port Controller
channel is selected by the input arbiter, a complete flit is stored in the buffers and a full signal is sent to the controller. If it is a header flit, the controller then sends the enable signal to the header decoder and then header decoder determines the destina- tion. Once the destination has been determined, the controller checks whether the needed output port is available to accept a new flit. Each switch includes a status
57 register to indicate the availability of the output ports. If one of the virtual channel of the desired output port is available, then flits from the input virtual channel are forwarded to the available output virtual channel, and if it is not available then flits wait in the input virtual channel. A brief description about the states is as follows:
S1-Checking the availability of input virtual channels, S2-Storing a flit in the avail- able the input virtual channel, S3- Destination address calculation, S4- Checking output port availability, S5- Transferring data from the input to the output virtual channel, S6- Ending packet transmission and S7- Updating output storage register.
The algorithm to control 4 Virtual Channels is shown in Figure 3.8.
Figure 3.8: Control Flow Algorithm for Input Virtual Channels
58 3.2 Packet Format
Data that needs to be transmitted between source and destination is partitioned
into fixed length packets which are in turn broken down into flits or words. A flit is
the smallest unit of data that is transferred in parallel between two routers. A packet
consists of three kinds of flits the header flit, the data flit and the tail flit, that are
differentiated by two bits of control information. The header flit contains information
of the destination router for each packet [37]. Figure 3.9 shows the packet format for
a header and data flit. Each packet has first field for the flit type - header flit or
Figure 3.9: Packet Format
the data flit. For a header flit, second field contains the length of the address as it is a variable and is dependent on the number of IPs in the NoC. The third field is for the packet length and contains the information about the number of flits in the corresponding packet. The next two fields provide the source and destination addresses. Usually for a given design, the length of the flit is constant, but the total number of flits in a packet can vary. For example, for a NoC with 1024 IP blocks, 10 bits are required for encoding the source and destination addresses in binary and 4
59 bits are required for indicating the address field length. If k represents packet length, then there are 2k number of flits in the packet.
3.3 NoC Router Design and Cost
In NoC, routers are major sources of power consumption and area costs. To study its effect on the complete network, a set of router designs (with different number of ports and queue sizes) are implemented at transistor level and using Register
Transfer Level (RTL) using two different technology nodes. All the links between tiles are 32-bit wide, which is also the flit size of the design. Router is implementing using wormhole routing and every link can transport one flit per clock cycle. For the NoC design, it is possible to pre-calculate the lowest operating frequency that allows the NoC to meet all the bandwidth requirements for a given mapping. This is done by computing the aggregate bandwidth requirements of all communication
flows overlapping on every individual inter-tile communication and by dividing then it by the link width as shown in equation 1.
Aggregate Bandwidth Minimum Frequency = (3.1) Link Width
The Design-I of router (NoC port) is implemented in 90nm CMOS technology using
Cadence Design Tools at transistor level. Design-II of the router is implemented
using ASIC flow and ARM’s standard cell library in TSMC 65nm process with 1.0
Volt in Synopsys Design Environment. Design-I is basically a custom design and
Design-II is ASIC design. Most of the SoC designs for mobile application are designed
using ASIC flow, whereas SoC designs for high-end processors are developed using
custom flow to achieve efficiency in terms of power performance and area. Both design
60 provides different perspective, but are not compared against each other. Design-I is discussed next.
3.3.1 Router Design-I
The NoC port of the router is implemented using Cadence Design Tool at tran- sistor level in 90nm CMOS technology. The power dissipation of the NoC port is determined at 200 MHz frequency for the worst case data input patterns. Consid- ering a die size of 20mm × 20mm and a supply voltage (Vdd) of 1.2 V, the total power dissipation for different NoC topologies is calculated for different number of
IP blocks. A top level description of the inputs and outputs for the NoC port is shown in Figure 3.10. The implemented router design is composed of input FIFO
Figure 3.10: NoC Port I/O
buffers, output FIFO buffers, input/output arbiters and control logic. Both input and output buffering are common choices in wormhole routers. The components are not completely optimized but simplified to measure the energy consumed to transfer
61 a flit through the router. The choices for the router design are in fact consistent with many of the other NoC designs. Virtual channels consist of several buffers con- trolled by an arbiter and a multiplexer and which grant access to only one virtual channel at a time according to the request priority. By increasing the number of virtual channels, the complexity of the switch; especially the arbiter component is increased as shown in Figure 3.11. When the number of virtual channels is eight,
Figure 3.11: Power per Component
the area of the arbiter and multiplexer is 35.5% of the total area of the port. For a 8
VC design, an 8x8 input arbiter and 8x1 multiplexer are needed to control the input virtual channels. The 8x8 input arbiter consists of 8x8 grant circuit and 8x8 priority matrix. Similarly, a 4x4 input arbiter consists of 4x4 grant circuit and 4x4 priority matrix. The values of the grant signals are determined by the priority matrix. The number of grant signals equals to the number of requests and the number of selection signals of the multiplexer. The area of two 4x4 input arbiters is smaller than the area
62 of 8x8 input arbiter as shown in Figure 3.11. In the architecture, rather than using one multiplexer and one arbiter to control the virtual channels, two multiplexers and two arbiters are employed as shown in Figure 3.12. Consequently, the area required
Figure 3.12: High Throughput Arbiter Design
to implement the switch with this architecture is less than the area consumed by the switch without this considerations. In the design, the virtual channels are divided into two groups, each group controlled by one multiplexer and one arbiter. Each group of virtual channels is supported by one interconnect bus. This port architec- ture has a great influence on the switch frequency and the throughput of the network in comparison to the original switch. Also, the area of two 4x1 multiplexers is smaller than the area of 8x1 multiplexer. The frequency of the switch design is characterized with different number of virtual channel for different network topologies as shown in Figure 3.13. When the number of virtual channel is increased beyond four, the
63 Figure 3.13: Max. Frequency of Switch with different number of Virtual Channels
maximum frequency of the switch is decreased for the BFT architecture. Since the complexity of the switch is reduced with division, the frequency of the improved de- sign HT-design is better than the original design. Increasing the number of virtual channels has a direct effect on the traffic on the interconnects. Increased traffic ef- fects contention on the bus and therefore increases the latency. The throughput is still increased as more links are in the channels. Using the throughput equations and frequency of the switch, the throughput for various NoC architecture is calculated.
The variation of the throughput with the number of virtual channels for various NoC architecture is shown in Figure 3.14. The High Throughput (HT) architecture in- creases the throughput of the network by 46% for the BFT architecture. The increase in throughput is minimum for the SPIN architecture. The throughput decreases when the number of VC channels is more than six in Cliche and Octagon architectures. The increase in throughput for different HT architectures is presented in Table 3.1. The latency of the network depends on the frequency of the switch and the number of
64 Figure 3.14: Throughput vs. Virtual Channels for Different NoC Topologies
Table 3.1: Percentage Increase in Throughput for Different NoC Architectures
Architecture Increase in Throughput (%) HT-Cliche 23 HT-BFT 46 HT-Octagon 31 HT-SPIN 11
virtual channels. With the circuit optimization as described earlier, the latency of the various NoC architectures is calculated and is shown in Figure 3.15. The latency of the BFT architecture is reduced by up to 59% for eight virtual channels. However, a severe increase in the number of virtual channels could cause a severe increase in the latency of the network. The latency of HT-Octagon, HT-Cliche and HT-SPIN with six virtual channels is reduced by 42%, 37% and 10% respectively. Considering a die size of 20mm x 20mm and a system of 256 IPs, the power consumption for
65 Figure 3.15: Latency of NoC Topologies with Different Number of Virtual Channels
different NoC topologies is shown in Table 3.2. Since the interswitch links are short in Cliche, there is no need for repeaters within the interconnects. The BFT topology consumes the minimum area and power as compared to the other NoC topologies.
Power dissipation of the network increases rapidly with more number of IPs on the die. Therefore finding ways to lower power dissipation is a primary concern in high speed, high complexity SoC designs. In NoC the switch for different architectures has different number of ports. The switch of BFT has six ports, four children ports and two parent ports. Each port of the switch includes input virtual channels, out- put virtual channels, a header decoder, controller, input arbiter and output arbiter.
Each port can be used as either input port or output port. If the port is used as the input port, the input virtual channels, header decoder and crossbar are active.
If the port is used as output port, the output virtual channels are active. This in conjunction with sleep transistors can be used to lower the power consumption in the switch as shown in Figure 3.16. The sleep transistors (M1) disconnect the input
66 Table 3.2: Power Consumption for Different NoC Architectures
Architecture Number of Power Dis- Power Power Total Reps. sipation in Dissipation Overhead Power Switches of Reps (%) (mw) (mw) and Links (mw) Cliche 0 24448 1398 5.4 % 25846 HT-Cliche 0 23715 2796 10.5 % 26511 BFT 960 15664 1458 8.5 % 17122 HT-BFT 1920 15194 2916 16.1 % 18110 Octagon 3840 19861 1094 5.2 % 20955 HT-Octagon 7680 19265 2188 10.2 % 21453 SPIN 12288 32264 10613 24.8 % 42877 HT-SPIN 24576 31296 21226 40.4 % 52522
circuit from the supply voltage during the output mode. The sleep transistors (M2) disconnect the output circuit from the supply voltage during the input mode. The acknowledgment signals (Ackin and Ackout) provided by the control unit are used to control the stand-by transistors M1 and M2 respectively. According to the received values of the request signals (Reqin and Reqout), the control unit generates the ac- knowledgment signals to determine the operating mode of the port; input mode or output mode. Using the Cadence tools, TSMC 90 nm CMOS technology, the NoC port with sleep transistor is implemented on the transistor level [38]. Given a die size of 20mm x 20mm and supply voltage (Vdd ) of 1.2V, the total power dissipation and power dissipation per component is calculated and presented in Table 3.3 and Table
3.4 respectively.
67 Figure 3.16: NoC Port Design for Reducing Leakage Power
The change in the power consumption with the number of IP blocks for different network topologies is shown in Figure 3.17. The power consumption for different NoC topologies increases by different rates with the number of IP blocks. The SPIN and
Figure 3.17: Power Consumption for Different NoC Architectures
68 Table 3.3: Power Reduction Per Component using Sleep Transistors
Component Power Consumption Power Consumption Percentage (mW) with Sleep Mode Reduction (mW) (%) Input FIFO 3.618 0.1029 97.2 % Header Decoder 0.955 0.2157 77.4 % Crossbar 0.473 0.1274 73.1 % Output FIFO 3.562 0.1003 97.2 %
Table 3.4: Power Reduction of a Switch for Different NoC Architectures using Sleep Transistors
Architecture No. Power Consumption Power Consumption Percentage of of a Switch (mW) of a Switch using Reduction Ports Sleep Transistor (%) (mW) Cliche 5 32.03 20.34 36.5 % BFT 6 41.29 29.60 28.3 % Octagon 3 18.66 10.87 41.8 % SPIN 8 54.66 39.07 28.5 %
Octagon architectures have much higher rates of power dissipation increase. Power dissipation of SPIN is increased by almost two orders of magnitude when the number of IPs increased from 16 to 1024. The BFT architecture consumes the minimum power as compared to other NoC topologies making BFT more attractive as a power-efficient
NoC topology.
69 3.3.2 Router Design-II using ASIC Design Flow
The Design-II is basically the full router design in ASIC flow using 65nm technol- ogy. To analyze the power consumption of the router, a set of router designs (with different number of ports and VCs) are modeled in VHDL. Table 3.5 shows main inputs and outputs ports of a 1 virtual channel router design.
Table 3.5: Input and Output Ports of NoC Router
Signal Direction Width length Clk in 1 Reset in 1 req put (1 to 6) in 1 req get (1 to 6) in 1 dest read[2: 0](1 to 6) in 3 data in input fifo[7: 0](1 to 6) in 8 full input fifo (1 to 6) out 1 data out output fifo[7: 0](1 to 6) out 8
The router design and its basic building blocks were synthesized using ARM’s standard cell library for TSMC 65nm CMOS process with Vdd=1.0 Volts in a typical corner using Synopsys Design Compiler (DC) tool. The gate level netlist then ob- tained from this step was imported as an input to Synopsys Prime Time-PX (PTPX), a tool for power calculation. Timing was verified prior to power calculations. To cal- culate the avg. power consumption based on switching activity different experiments were performed using PTPX to measure the power consumption of the design at var- ious operating frequencies. In order to analyze the power consumption of a design,
70 it is important to consider all the factors which contribute to both the static and
dynamic power. These are shown in the Figure 3.18. The design netlist is required to
Figure 3.18: Power Analysis Requirements
determine the design activity and type of cells used in the design and to accurately compute the capacitance on the drivers. Cell Library models are necessary in order to compute the internal power of the CMOS cells used in the design and the signal activity of the design affects both the static and dynamic power consumption. The static power(cell leakage) is often state dependent, and the dynamic power is directly proportional to the toggle rate of the pins. Net parasitics (or capacitances) affect the dynamic power of the design. Switching power is directly proportional to the net capacitance. Internal power depends on both the input signal transition times, which are determined by the net parasitics, as well as the output load, which is a combination of the net parasitics and input pin capacitances of the fanout. Figure
3.19 shows the methodology used to measure the power consumption of the router design using the standard switching activity file format.
In this flow, the analyze and elaborate commands read the RTL design into an active memory and convert it to a technology-independent format called the GTECH
71 Figure 3.19: Power Measurement Flowchart for NoC Routers using Synopsys Tools
design. This is done using Design Compiler tool, which is the core of the Synopsys synthesis software. Then, a forward-annotated Switching Activity Interchange For- mat (SAIF) file is generated using the rtl2saif command of Synopsys. This forward- annotated file contains directives that determine which design elements to be traced during simulation. The forward-annotated SAIF file is fed into the simulator with the
VHDL test bench and technology files to generate a back-annotated SAIF file. The back annotated SAIF file contains information about the switching activity of the
72 synthesis-invariant elements in the design. Then, the back-annotated SAIF is used with the gate-level net-list file(Data-Base (DB) file) produced by the Design Compiler to calculate the power consumption of the router. Power Compiler-PX is used for calculating the power consumption of the switches.
Power dissipation in NoC, arise from two different sources: 1) Switches and 2)
Interswitch Links. A comparative analysis for power consumed by different compo- nents of router is shown in Figure 3.20 with area estimates. For Design-II, an average
Figure 3.20: Power per Component
power was measured using a wireload model for the synthesis and static switching activity on all input ports. In relative terms, the arbiter’s contribution to total power is lower than 8% while input and output buffers consume approximately 70 % of total power. This breakdown reflects typical power consumption percentage for wormhole switches at individual level. Router Power is a direct function of number of ports
73 required. Different NoC architecture require different number of switches with differ- ent number of ports. For example, a 16 IP Cliche architecture will have 4 switches with 5 ports, 8 switches with 4 ports and 4 switches with 3 ports. At network level, the aggregated power consumption of switches play a larger role in determining the total power consumption. The total power consumed by different NoC architecures is shown in Figure 3.21 and is also listed in Table 3.6. BFT architecture consume
Figure 3.21: Total Power Consumed by Routers for Different Number of IPs
the minimum power and SPIN the maximum power. The rate of increase is more pronounced with the increase in the number of IPs.
Low Power Router Design: Total power consumed by a NoC router is the sum of dynamic and leakage powers. Dynamic power, or switching power is primarily the power consumed when the device is active - that is, when signals are changing values.
Static power is the power consumed when the device is powered up but no signals
74 Table 3.6: Power Overhead of Routers for Different NoC Architectures in RVT process(f = 200MHz,α = 0.1)
Power Consumption (mw) Architecture 16 IPs 64IPs 256 IPs 1024 IPs Cliche 41.64 188.20 798.12 3285.16 BFT 21.08 106.00 489.04 2021.20 SPIN 31.56 211.52 846.08 3384.32 Octagon 42.44 178.64 732.32 2929.28
are changing value. The first and primary source of dynamic power consumption in
a CMOS logic gate is switching power - the power required to charge and discharge
the output capacitance on a gate. Dynamic power dissipated by a CMOS design is
largely described by the equation:
2 Pdyn = αClVddfclock
Where Cl is the capacitance (a function of fanout, wire length and transistor size),
Vdd is Supply Voltage, α is activity factor (meaning how fast or often wire switches) and f is the clock frequency. Because, dynamic power is linearly proportional to switching frequency, using lower frequency (if possible) is a primary way to reduce the switching frequency and thereby power. Table 3.7 list power consumption of
Design-II NoC routers for various operating frequencies for an α = 0.1. As we can see from the data, power consumed by any switch is a direct linear function of frequency.
The rate of increase for a 6port router for the shown frequencies is almost 2mw per
100 MHz increase in frequency. The fact that dynamic power is linearly proportional to the capacitance being switched is more of a design and implementation constraint
75 Table 3.7: Power Consumption of 4-, 5-, 6-,7- and 8- port NoC routers at various operating frequencies
Power Consumption (mw) Frequency 4-Port Router 5-Port Router 6-Port Router 8-Port Router 500 MHz 6.38 8.22 10.05 13.96 450 MHz 5.73 7.40 9.00 11.86 400 MHz 5.07 6.57 7.95 10.84 350 MHz 4.46 5.72 6.95 9.42 300 MHz 3.80 4.91 5.96 8.51 250 MHz 3.18 4.12 4.97 6.86 200 MHz 2.56 3.30 3.99 5.33 150 MHz 1.94 2.50 3.02 4.38 100 MHz 1.30 1.68 1.94 2.92
and is improved primarily by reducing the length of inter-router interconnects driven
and design complexity or by area. The voltage term has the greatest effect on power,
and if the frequency can be reduced to allow a reduction in the voltage, the power is
reduced quadratically. Leakage Power (or Static power), is a function of the supply
voltage (Vdd), threshold voltage (Vt), and transistor sizes. The scaling of threshold voltages has been a large factor in the increasing leakage currents in smaller technology generations. A trend for leakage power vs. technology nodes in Intel processors is shown in Figure 3.22. Basically, at 90nm and below technology nodes, leakage power management is essential in the design process. Leakage power can consume up to half of the total power dissipated by transistors. There are four main sources of leakage currents in a CMOS gate : (i) Sub-threshold Leakage - the current which flows from
76 Figure 3.22: Leakage Power vs Technology Nodes [5]
the drain to the source current of a transistor operating in a weak inversion region,
(ii) Gate Leakage - the current which flows directly from the gate through the oxide to the substrate due to gate oxide tunneling and hot carrier injection, (iii) Gate Induced
Drain Leakage - the current which flows from the drain to the substrate induced by a high field in a MOSFET drain caused by a high VDG. and (iv) Reverse Bias Junction
Leakage - caused by minority carrier drift and generation of electron/hole pairs in
depletion regions. A guide to help in when to consider what type of leakage is provided
in Table˜reftab:sug. There are several approaches to minimize leakage current. One
technique is known as Multi-VT : using high VT cells wherever performance goals allow
and low VT cells where it is necessary to meet timing. In an ASIC design flow, there
are three types of library cells HVT, LVT, RVT. The design of the router Design-II
was evaluated for different frequencies to see the tradeoff. Leakage power savings
for a 6-port router design in percentage are shown in Figure 3.24. ent frequencies
77 Table 3.8: A Guide for Leakage Power Considerations
Parameter L(µm) Tox(m) Isub Igate Ijunc Long Channel > 1 > 3 x x x Short Channel > 0.18 > 3 y x x Very Short Channel > 0.090 > 3 y y x Nanometer Channel < 0.090 < 2 y y y
to see the tradeoff. Leakage power savings for a 6-port router design in percentage are shown in Figure 3.24. LVT process consume the highest leakage current whereas
Figure 3.23: Difference in Leakage Power for a 6 Port Router Design using Different Vt Cells for a 6 Port Router Design
HVt the minimum. The design was functional for wide range of frequencies, with the major tradeoff of the area. Figure 3.24 shows the increase in area with frequencies for all the three different processes. To meet the timing and compensate for the
78 Figure 3.24: Frequency vs. Area of the Switch
delay in HVT process, smaller cells are being exchanged for bigger cells. Thus HVT process consumes the maximum area to meet the timing requirements. The routers were implemented in all the processes, and the power consumed by different NoC architectures is listed in Table 3.9. Even though LVT process has higher leakage
Table 3.9: Power Dissipation for a Network of 64 IPs at 200 MHZ and α = 0.1
Power Consumption (mw) Area (mm2) Architecture LVT HVT Reduction (%) LVT HVT Reduction (%) Cliche 198.52 450.44 55.93 44.92 45.04 0.27 BFT 111.84 207.76 46.17 25.36 25.48 0.47 SPIN 235.52 406.4 42.05 49.29 49.39 0.20 Octagon 188.4 352.92 46.62 42.79 42.89 0.23
79 power, but the total power consumed by the architectures in LVT process is much less than that by the HVT process. By using LVT process for a system of 64IPs as much as 56% power savings can be achieved for the Cliche architecture. Again BFT is more power efficient in both the processes.
3.4 Summary
In a NoC design, routers consume significant amount of power and present area cost. In this chapter NoC router design is discussed in detail and is evaluated for power consumption, area and performance at various operating frequencies. Based on the router power, total power consumption for different NoC architectures is evaluated for different number of IPs. The main goal of a NoC design is to consume low power while serving the same performance. To achieve low power NoC design, it is essential to implement power saving techniques at each level of design abstraction. Within router design different approaches applicable to save dynamic and static power are discussed and evaluated in this chapter.
80 CHAPTER 4
High Performance NoC Interconnects
Early CMOS processes had a single metal layer, and for many years only two to three metal layers were available, but with advances in chemical-mechanical polishing and in some other semiconductor processes, it is far more practical now to manufac- ture many more layers in silicon technology than ever before. As shown in Figure 4.1
Figure 4.1: Metal Layers in different technology nodes
a 180 nm process has 6 metal layers and the layer count has been increasing at a rate of about 1 per technology generation. Table 4.1 lists the technology parameters and equivalent circuit model parameters from ITRS reports (2001-2010) for technology
81 nodes 130nm to 11nm. Minimum interconnect spacing Smin is assumed to be equal
Table 4.1: Technology and Circuit Model Parameters from ITRS Reports(2001-2010)
Year of Production 2001 2004 2007 2010 2013 2016 2019 2022 Technology Node (nm) 130 90 65 45 32 22 16 11 Number of Metal Layers 9 10 11 12 13 13 14 15
Metal 1 Wire Wmin(nm) 130 90 65 51 31 22 15 11
Int. Wire Wmin(nm) 225 132 98 51 31 22 15 11
Global Wire Wmin(nm) 335 230 145 68 48 33 24 16 Global Wire T(nm) 670 482 319 236 112 77 56 37
Tox (µm) 6.3 4.7 3.9 2.9 2.4 1.9 1.4 1.04
Relative Permitivity r 3.3 3.1 3.0 2.7 2.6 2.3 2.1 1.8 Resistivity ρ(10−6Ω.cm) 2.2 2.2 2.2 2.2 2.06 2.06 2.06 2.06
to the minimum interconnect width Wmin in all technologies. k is the dielectric con- stant and ρ is the resistivity of the Cu material. As CMOS technology is continuing to scale, wiring delay is dominating gate delays. Figure 4.2 shows gate delay vs. wire delay in different technology nodes. Wiring delay doubles with each technology node and increases quadratically as a function of wire length. Even a small number of global interconnects, where the signal delay is very high, can have a significant impact on system performance and may also influence timing closure of the design. Addition- ally, with technology scaling, global wires are currently undergoing a reverse scaling process resulting in wider and thicker top metal layers and an increase in the wire
Aspect Ratio (AR). This leads to increase in self and coupling capacitances, which causes global communication to become an increasingly power-consuming process.
82 Figure 4.2: Gate Delay vs. Wire Delay in Different Technology Nodes
In NoC network, the interswitch wire segments are the longest on-chip wires after clock, power and ground wires [39]. Mostly global and semiglobal wires are suitable and recommended for NoC interconnects, since for NoC interconnects, one of the most critical challenge is to provide the desired system bandwidth for the SoC design. It is increasingly important for many large scale SoC designs to have higher bandwidths to satisfy massive inter-processor communication. Bandwidth is critical, in part because higher bandwidth decreases contention, and in part because phases of a program may push a large volume of data around without waiting for transmission of individual data items to be completed. In order to achieve higher bandwidth in NoC, it is possible to design pipelined routers such that they process one flit per cycle, but the duration of the clock cycle usually determines how fast each flit can be processed in the network. In nanometer NoCs, this cycle time is not limited by the logic in
83 between two clocked elements, but by the links between two routers, and thereby limiting the system performance.
Continued scaling of technology has thus posed a set of new challenges to the design community. Interconnects in deep nanometer technologies suffer from three major problems: (1) Large propagation delay due to capacitive coupling induced by tighter geometries (delay problem), (2) High power consumption due to increase in both self and coupling capacitances (power problems) and (3) and increased suscepti- bility to errors due to increase in deep submicron noise [40]. As a result, interconnects can no longer be ignored in the design cycle and must be accounted for early in the process. In this chapter, a layout-aware analysis of NoC interconnects is presented to achieve a high performance and low power NoC design .
4.1 NoC Interconnects
In a NoC design, wires linking two switches are called interconnects. NoC inter- connects play a crucial role in communication and can have a large impact on total power consumption, wiring area and system performance. One of the most critical challenge of a NoC design is to provide the desired bandwidth set forth by the main
SoC design to manage certain performance threshold. However, as technology is scal- ing in nanometer domain, achieving higher bandwidths for communication channels could be tricky and may requires some mitigation schemes [41][42]. NoC links typically consists of number of parallel signal wires of fixed width and spacing as shown in Figure 4.3. These links can be used directly to express number of metrics, such as Data Rate, Bandwidth Density or Bisectional Bandwidth. However Channel
Bandwidth (Data Rate) is preferred and most appropriate metric to estimate system
84 Figure 4.3: NoC Interconnects
performance and can be expressed as follows
N ChannelBandwidth = (4.1) Delay
Where N is the total number of signal wires in the link, and Delay is the delay of a single wire. In order to achieve higher bandwidth, it is thus important that the delay is minimum. Interconnect delay is a function of wire resistance and capacitance. The delay of a distributed RC line driven by an ideal driver (zero output impedance) at the near end, and an open termination at far end can be expressed using the Elmore
Delay model.
2 tDelay = 0.4RCL (4.2)
Where L is the wire length, R is the wire resistance per unit length, C is the capac- itance per unit length and Delay is the wiring delay. This is a good approximation and is reported to be accurate within 5 percent for a wide range of R and C values.
In NoC design, the minimum conceivable clock cycle time in a highly pipelined design can be assumed to be equal to the value of 15FO4, with FO4 defined as the delay of an inverter driving four identical ones [16]. In different technology nodes, FO4 can
85 be estimated as
425 × Lmin (4.3)
where Lmin is the minimum gate length in any technology node [43]. Figure 4.4 shows that the requirement of 1 clock cycle on the resource size. For a high performance design (operating at its max. frequency), the interconnect delay between resources should be less than 15FO4 time. In long wires the intrinsic RC delay can easily
Figure 4.4: One Clock Cycle Requirement for High Performance NoC Designs
exceed this 15FO4 limit and thereby limiting the clock cycle time of the design and as a consequence system bandwidth may suffer. Figure 4.5 shows intrinsic RC delay in different technology nodes and 15FO4 limits. In NoC length of interconnects is a function of die size, architecture and number of IPS, thus depending on the length of the wires different techniques may be necessary to reduce intrinsic RC delay.
4.1.1 Performance Optimization Using Intrinsic RC Model
The delay of a wire is a function of the RC time constant (R is the Resistance and C is the Capacitance). In older technologies, wires were wide and more cross
86 Figure 4.5: Intrinsic RC delay and 15FO4 limit
section area implied less resistance and more capacitance. It was okay to model wires
with capacitance only, but as fabrication technologies scale down, width of wires is
reduced [44]. As a result, wire resistance per-mm is increasing and is no longer negli-
gible. Short wires can be modeled with a lumped RC approximation. The resistance
per unit length of a wire can be defined by the following expression: Where R is the
resistivity, L is the length, T is the thickness and TW is the cross section area of the
wire. ρ is the metal resistivity and is dependent on the material used. Aluminum, once the main material for interconnect has long been replaced by Copper for its low resistivity (2.2 Ohm). In newer technologies copper is being used, but in nanome- ter technologies even the copper interconnects are becoming increasingly inadequate to meet the speed and power dissipation goals of highly scaled ICs. Table 4.2 list
87 Figure 4.6: Interconnect Resistance
resistivity of some of the commonly used materials. Wire capacitance, on the other
Table 4.2: Bulk Resistivity of pure metal at 22 degree C
Metal Resistivity (µΩ.cm) Silver(Ag) 1.6 Copper (Cu) 1.7 Gold (Au) 2.2 Aluminum (Al) 2.8 Tungsten(W) 5.3 Molybdenum (Mo) 5.3 Titanium (Ti) 43.0
hand is more complex as many of its sub-components are geometry dependent. It
is usually carried out by representing complex structures as a collection of simple
geometric elements, and then each parasitic value is combined using superposition or
by introducing scale factors to obtain the parasitics of the complex structure. There
are many commonly used industrial tools which simply extract the wire capacitance
88 parameters for a given structure. Some of the tools are FastCap, FastHenry, StarRC
and QRC etc. For modeling and estimation purposes, however some simple tech-
niques are applicable and can be used. As shown in Figure 4.7 the capacitance per
unit length of a wire can be modeled by four parallel-plate capacitors for each side
and fringing capacitance. An accurate modeling of wire capacitance is, however a
non-trivial task and still an ongoing subject of advanced research. The three major
Figure 4.7: Cross-Sectional View of Semi-Global Layer Interconnects
components of wire capacitance shown in Figure 4.7 are related to the geometry by
the following relations C C = C + 2.C .W + c (4.4) T a b S
Where CT is the total capacitance, Ca is the fringing capacitance, Cb is the parallel plate capacitance due to the top and bottom layers of metal and is proportional to the interconnect width, and Cc is the coupling capacitance between neighboring interconnects and is inversely proportional to the interconnect spacing S. The parallel
89 plate capacitance Cb can be described as
L C = (4.5) b ox H
Where W is the width of the wire, L is the length of the wire, and H is the dielectric
height and ox is the dielectric constant. The ox can be defined as follows
ox = r × o (4.6)
Where r is the permitivity of the dielectric material. SiO2 is mostly is the dielec- tric material of choice in integrated circuits. Recently some materials called Low-k dielectrics with lower permittivity are coming in use in newer technologies to reduce wiring capacitance. It is an attractive option because it reduces both wire delay and power consumption. Adding fluorine to to the silicon dioxide creates fluorosili- cate glass (FSG) with a dielectric constant of 3.6, widely used in 130nm processes.
Adding carbon to the oxide can reduce the dielectric constant to 2.7-3.0. Alterna- tively, porous polymer-based dielectrics can deliver even lower dielectric constants.
For example, SiLK, from DowChemical, has k = 2.6 and can be scaled to k = 1.6-2.2 by increasing the porosity. Developing low-k dielectrics that can withstand the high temperatures during processing and the forces applied during CMP is usually a major challenge. Table 4.3 presents the relative permittivity of several dielectrics used in integrated circuits.
Fringing and coupling capacitance on the other hand are more difficult to compute and require numerical field solver for exact results. However for modeling and esti- mate purposes empirical formulas which are computationally efficient and relatively
90 Table 4.3: Relative Permittivity r of some Dielectric Materials
Material Relative Permittivity r Free Space 1 Aerogels 1.5 Polymicides(Organic) 2-4 Silicon Dioxide 3.9 Glass-epoxy (PC board) 5
Silicon Nitride (Si3N4) 7.5 Alumina 9.5 Silicon 11.7
accurate are " # W W 0.25 T 0.5 C = .L. + 0.77 + 1.06 + 1.06 (4.7) a ox H H H
" 0.222# 4 W T T H 3 C = .L.S 0.03 + 0.83 − 0.07 (4.8) c ox H H H S
These empirical formulas are accurate to within 6% for processes with AR less than
3.3 [35]. From equation 4.4, it can be seen that increasing the width of the wire can significantly decrease resistance, while also resulting in a modest increase in ca- pacitance due to the top and bottom layers. This leads to less than proportional increase in capacitance, still the overall RC delay improves. Similarly, increasing the spacing between adjacent wires reduces capacitance to the adjacent wires and leaves resistance unchanged. This also reduces RC delay by significantly reducing the cou- pling capacitance. While T and H parameter are fixed for each metal layer in a given
91 process technology, parameters W and S can be chosen by the link designer to achieve an acceptable delay. By allocating more metal area per wire and increasing the wire width and spacing, the overall effect is that the product of Rwire and Cwire decreases,
resulting in lower wire delays. If a design is limited by the wiring space available then
varying W and S for optimal delay will have an impact on the number of wires in
the link by the following relation
Awire = N.W + (N − 1).S (4.9)
The primary difference between wires in the different types of metal layers is the wire
width and spacing (in addition to the thickness). Increasing interconnect width and
space in a limited area will reduce the number of links and thus the overall system
bandwidth. As a result, these geometric adjustments to achieve lower delay can create
an upper bound on the conceivable bandwidth.
4.1.2 Performance Optimization using Repeater Insertion
Widening a uniform line has a marginal impact on the overall wire delay [10].
The resistance and capacitance of a wire are both linear functions of the wire length.
Hence, the delay of a wire, which depends on the product of wire resistance and
capacitance, is a quadratic function of wire length. For longer NoC interconnects,
wire sizing and spacing alone may not be sufficient to limit this quadratic growth. A
simple and more effective strategy for reducing the delay of a long interconnect is to
strategically insert buffers along the length of the line. These buffers are typically
called repeaters and the process is called repeater insertion. In this technique, the
delay of a wire is reduced by splitting the line into multiple smaller segments of
equal lengths and by inserting a repeater between each segment to actively drive the
92 wire. As a result, wire delay becomes a linear function of wire length. Figure 4.8
shows an interconnect line with inserted repeaters. In repeater insertion, usually the
Figure 4.8: Interconnect with Repeaters
decreased interconnect delay is partially offset by the additional delay of the inserted
repeaters. Overall wire delay can be minimized by selecting optimal repeater sizes
and spacing between repeaters [43] and this technique is commonly employed in
modern-day processors. A number of repeater insertion methodologies for different
types of optimization exists [45][46][47]. The minimum delay of resulting RC
circuit is achieved when the delay of the repeater equals the wire segment delay.
Using methodology presented in [48], optimal repeater size hopt and optimal inter
buffer segment line length kopt can be calculated using equations 4.10- 4.11. r Rs.C hopt = (4.10) R.Cs
r 2.R (C + C ) k = s s p (4.11) opt R.C
Where Rs and Cs are the resistance and capacitance of a minimum size inverter, R is the resistance of wire per unit length, and C is the capacitance of wire per unit
93 length. Similarly optimal width and optimal spacing are given in equation 4.12 and
4.13 respectively.
r CaSopt + Cc Wopt = (4.12) Cb
s CcWopt Sopt = (4.13) Ca + CbWopt
If an interconnect is divided into n number of segments with a repeater driving each
section, then the total wire delay equal to the number of repeated sections multiplied
by the individual section delay. Furthermore the delay of one segment length hopt
driven by a buffer of size kopt is given by s ! 2Co τopt = 2 ∗ Rs (Co + Cp) 1 + (4.14) Co + Cp
The intrinsic RC delay of the longest interconnect (10mm) in 65nm technology is calculated to be 3537.6 ps, whereas the 15FO4 time is 414.375 ps in the same tech-
nology node . The achievable frequency by the technology node definition is 2.41
GHz, whereas due to the length and associated delay of the longest interconnect; the
achievable frequency is limited to 0.28 GHz only. Using equations 4.10- 4.13, for
buffer insertion, width and spacing optimization the delay can be improved to 381.94 ps (frequency of 2.61 GHz). With optimal repeater insertion, the growth of the in- terconnect delay becomes almost linear with the wire length. However for large high performance NoC design, the total number of such repeaters can be prohibitively large and can take up significant portion of silicon and routing area and additionally can consume significant amount of power. Power dissipation is discussed in next section.
94 4.2 NoC Power Consumption In Physical Links
Power consumption in on-chip interconnect represents a significant portion of the
chip total power consumption (up to 50%) [49]. In NoC, interconnect length is a func-
tion of topology, die size and number of IPs. Considering a chip size of 20mm×20mm, technology node of 65nm, and a system of 64 IP blocks, the number of interswitch links and total length is obtained for different NoC architectures and is presented in Table I. When the interconnect length is larger than the critical length, repeater insertion is necessary. For a node 65nm technology node, using methodology pre- sented in [48], the critical interconnect length is 1.44 mm. The links are assumed to be 16 bit wide and 2 control signal lines per chanel link. Using optimal inter- connect width of 799, optimal interconnect spacing of 329; optimal repeater size of
105, power consumption of inter switch links and repeaters is calculated for different
NoC architectures and is presented in Table 4.6. Power Overhead as a percentage of total power consumption is presented. Repeater insertion along with spacing and width optimization, may consume as much as 6 percent of total chip area, which is significant considering; routers are not included. Similarly, repeaters alone can con- sume as much as 1048 mw of power in SPIN topology. This again is quite significant, considering the size and high end power budget of the chip. The difference in the area and power budgets of a logic design vs. physical design can either offset a design completely or may result in endless iterations to get a high performance design to work. As mentioned previously this problem really demands for an improved design
flow to comprehend the problems imposed by the interconnects in the deep nanometer regime. A much more efficient design flow is proposed and shown in Figure 4.9.
95 Table 4.4: Interconnect Power and Area Consumption: Intrinsic Case, f=400 MHz and α = 1
Architecture Total Wire Max. RC Bandwidth Pline (mW) Area Length (mm) Delay (Gbps) (mm2) (ps) Cliche 5040 221.11 72.36 374.88 1.4616 BFT 8640 3537.69 4.52 642.64 2.5056 SPIN 20160 3537.69 4.52 1499.50 5.8464 Octagon 5125 1989.95 8.04 381.20 1.4863
Table 4.5: Interconnect Power and Area Consumption: Width and Space Optimiza- tion, f=400MHz and α = 1
Architecture Total Wire Max. Bandwidth Pline (mW) Area Length (mm) Length (Gbps) (mm2) (mm) Cliche 5040 2.5 507.514 231.50 5.6851 BFT 8640 10.0 84.49 396.85 9.7459 SPIN 20160 10.0 84.49 925.99 22.7405 Octagon 5125 7.5 116.12 235.40 5.7810
Table 4.6: Total Power and Area Consumption, f=400MHz and α = 1
Architecture Total Power Area Total Power Total Area No. of (mW) (mm2) (mW) (% ↑) (mm2) (% ↑) Reps Cliche 2016 190.51 0.01291 422.01 (11.2 ) 05.6980 (74.35) BFT 4608 435.46 0.02951 832.31 (22.8) 09.7754 (74.37) SPIN 11520 1088.64 0.07379 2014.63 (25.6) 22.8143 (74.37) Octagon 2592 244.94 0.01660 480.35 (20.6) 05.7976 (74.37)
96 4.3 A Layout-Aware NoC Design Methodology
Interconnects in deep nanometer regime pose real challenge to meet system per- formance and require optimization to meet bandwidth requirements. These optimiza- tions along with power hungry global interconnects can make a big difference in area and power cost of the total system. The difference in the budget of logic design vs. physical design can either offset a design completely or may result in endless iterations to get a design to meet its design goals. As a consequence the traditional top-down approach taken in many design processes is not sufficient to deal with this problem and accurately account for the interconnects simultaneously with the rest of the de- sign flow. The challenges associated with interconnects require a new and improved design flow and innovative optimization tools that would help to accurately model the complex relationship that exists at the nanometer scale. An efficient design flow based on NoC interconnect modeling is presented [50] and is shown in Figure 4.9.
Modeling and Simulation is a key to account for interconnects in the early stages of design. The modeling and simulation capabilities can range from high-level pre- dictions of interconnect impact on the IC layout and electrical behavior to low level rough estimates.
4.4 Summary
Wires are not ideal as drawn in schematic diagrams, there are huge parasitics associated with the interconnects which exhibit undesired effects and hinder the per- formance. The impact of parasitics is more pronounced in deep nanometer regime.
Interconnect performance analysis methodologies are thus highly important for a suc- cessful design and shorter design cycle time. An accurate estimations of interconnect
97 Figure 4.9: An Improved ASIC Design Flow for NoC in Deep Nanometer Regime
delay, power, and area early in the design cycle is crucial for effective system-level optimization. The commonly used top down design approach for digital design is no longer valid in the presence of deep nanometer effects and can lead to misleading design targets. Interconnect design in the nanometer geometries is a compromise between density, RC performance, and cost. Narrow wires deliver high-density but relatively poor RC performance, while wide wires have better RC performance. Man- aging these factors through interconnect modeling and design methods is necessary to accurately compensate for power and delay estimations for physical design opti- mization and therefore for a High performance SoC design.
98 CHAPTER 5
Layout Aware NoC Design Methodology
In earlier generations of IC designs, the main parameters of concern were timing and area. EDA tools were designed to maximize the speed while minimizing area.
Power consumption was a lesser concern. CMOS was considered a low-power tech- nology, with a fairly low power consumption at the relatively low clock frequencies used at the time, and with negligible leakage current. In recent years, however, de- vice densities and clock frequencies have increased dramatically in CMOS devices, thereby increasing the power consumption dramatically. At the same time, supply voltages and transistor threshold voltages have been lowered, causing leakage current to become a significant problem. As a result, power consumption levels have reached their unacceptable limits, and low power design has become as important as timing or area overhead in any design.
High power consumption can result in excessively high temperatures during op- eration. This means that expensive ceramic chip packaging must be used instead of plastic, and complex and expensive heat sinks and cooling systems are often required for product operation. Laptop computers and hand-held electronic devices can be- come uncomfortably hot to the touch. Higher operating temperatures also reduces reliability because of electromigration and other heat-related failure mechanisms.
99 In portable and hand held devices, high power consumption reduces battery life.
As more and more features are added to a product, power consumption increases and the battery life is reduced even more requiring a larger, heavier battery or shorter life between charges. Battery technology has lagged behind the increased demands for power. Another aspect of power consumption is the sheer cost of electrical en- ergy used to power millions of computers, servers, and other electronic devices used on a large scale, both to run the devices themselves and to cool the machines and buildings in which they are used. Even a small reduction in power consumption of a microprocessor or other device used on a large scale can result in large aggregate cost savings to users and can provide significant benefits to the environment.
NoCs are being considered as a potential candidate to solve on-chip communica- tion problems in large scale SoC design. As technologies continue to shrink into deep nano meter regime there are many challenges faced by these complex SoC designs including the first and the foremost important goal of low power design. Low power design is essential, as feasibility of NoC heavily depends on the power budget it may require. As power becomes one of the major limiting factor in IC designs, designers need new capabilities across entire design flow to develop a keen understanding of the power consumption sources and tradeoffs among different NoC topologies. To aid in this process, this chapter presents a high level power estimation methodology for different NoC architecture designs.
5.1 CMOS Power Dissipation
Before dwelling deep into the techniques to save power, lets first look into the basics and examine what we mean by these terms and why they are important. The
100 instantaneous power P (t) consumed or supplied by a circuit element is the product
of the current through the element and the voltage across the element [35]
P (t) = I(t)V (t)
The total power consumed by an SoC design consists of dynamic power and static power. Dynamic power is the power consumed when the device is active - that is, when signals are changing values. In CMOS devices dynamic power is consumed mainly because of (i) charging and discharging load capacitances as gates switch and
(ii) short-circuit current while both PMOS and NMOS stacks are primarily on. The
first and primary source of dynamic power consumption is switching power -the power required to charge and discharge the output load and is defined as
2 Pswitching = αCVDDf (5.1)
Where α is the switching activity factor, C is the load capacitance, VDD is the supply voltage and f is the clock frequency. Note that switching power is not a function of transistor size, but rather a function of switching activity and load capacitance.
Thus it is very much data dependent. On the other hand, Static power is the power consumed when the device is powered up but no signals are changing value. Static power consumption in CMOS devices is mainly because of leakage and consists of (i) subthreshold leakage through OFF transistors, (ii) gate leakage through gate dielectric and (iii) junction leakage from source/drain diffusions. Putting this all together the total power of a circuit can be defined as
Ptotal = Pdynamic + Pstatic (5.2)
where
Pdynamic = Pswitching + Pshortcircuit (5.3)
101 Pstatic = (Isub + Igate + Ijunc)VDD (5.4)
Power dissipation is a significant constraint in any large scale chip design including
NoC. The elevation of power to first-class design constraint requires that power esti-
mates should be done at the same time as performance studies are in the design flow.
In NoC the opportunity to influence power consumption varies differently at each
level of design abstraction. The higher the level of design abstraction, the greater the
influence power management techniques have on power consumption. As shown in
Figure 5.1, system designers can have roughly an order of magnitude more impact by
addressing power issues early in the design process. There are number of techniques at
Figure 5.1: Diminshing Returns of Power
the architectural, logic design, and circuit design that can reduce the power for a par- ticular function implemented in a given technology. In NoC, at the architecture level, power saving are achieved mainly through different architecture choices available for the design. Since architectural decisions often cannot be reversed, power estimates
102 must be done early in the design phase. A natural solution to this problem is to de- velop NoC power models at architectural level. Power modeling at architectural level can provide rough estimates and additionally allows tradeoffs between hardware and software partitioning. Architectural level power modeling can influence both power and performance savings in a design. Power models for different NoC architecture are developed in the subsequent sections.
5.2 Power Analysis for NoC-based Systems
An important challenge for current and future interconnect architectures is to provide low power solution. This demand is mainly driven by advanced applications in the Large SoC designs. However, communication network alone can consume a significant portion of total system power in any design. In modern processors nearly 1/3 of the power is spent in logic and wires. For instance, MIT RAW on- chip network consumes 36% of the total chip power and Alpha-21364 microprocessor dissipates almost 20% percent of total chip power in the interconnects alone [33]. In
NoC design, power consumption is mainly because of three components, namely (1)
Routers (2) Interconnects (Wires) and (3) Repeaters. The total power dissipation in the network can be defined using the following equations
Ptotal = Pswitches + Pline + Prepeaters (5.5) where
2 Pline = αCLVDDf (5.6)
2 Prepeaters = Nrep(αhoptCoVDDf + Ileak−repVDD + Ishort−repVDD) (5.7)
103 where f is the clock frequency and VDD is the supply voltage. Pswitches is the to- tal power consumed by the switches, Pline is the total power consumed in inter- switch links, and Prep is the total power consumed by the repeaters [51]. Repeaters are required for long interconnects to enhance system performance. The number of repeaters required depends on the length of the interswitch link and technology node used. Different NoC architecture require different number of switches, different lengths of interswitch interconnects and thus different number of repeaters. The dif- ference in these module for same number of IPS and chip size can thus create huge variations in power numbers.
5.2.1 Cliche Architecture Power Model
Cliche architecture or 2D-mesh is the most commonly used NoC topology in lat- est commercial products and industrial prototypes. As shown in Figure 5.2, it is a uniform, scalable and a simplest form of architecture from layout perspective. In
Figure 5.2: Layout of Cliche architecture
104 this architecture, all the interswitch wire segments are of same length and can be determined using the following expression. √ Area L = √ (5.8) N
If the numbers of IPs are equal in the x and y direction (m=n) then the number of √ √ horizontal links are equal to the vertical links; and can be calculated using N( N −
1). Depending on technology node, the optimal length for repeater insertion could be obtained by using equation 4.10. Thus for Cliche architecture, the total interconnect length and the required number of repeaters can be calculated using the following expressions
√ √ lCliche = 2. Area( N − 1)Nwires (5.9)
$ √ % √ √ Area Nrep−Cliche = 2. N( N − 1) √ Nwires (5.10) Kopt. N
Using total number of switches, total wire length for interconnects and the total num- ber of required repeaters, the total power consumption for the CLICH architecture can be calculated using the following expression
√ √ 2 PT otal−Cliche = Nsw.Pswitch + 2. Area( N − 1)Nwires.C.VDDf $ √ % √ √ Area + 2. N( N − 1) √ Nwires (5.11) Kopt. N 2 × αhoptCoVDDf + Pleak−rep + Pshort−rep
5.2.2 BFT Architecture Power Model
In Butterfly Fat Tree (BFT) architecture, the IPs are placed at the leaves and switches at the vertices. At the lowest level (0), there are N IPs and IPs are connected
105 to N/4 switches at the first level. A floor plan for 16-IP BFT network is shown in
Figure 5.3. The number of levels in a BFT architecture depends on the total number
Figure 5.3: Layout of BFT architecture
of IPs, and can be calculated using equation 5.13. " # N N levels N = 1 − (5.12) sw 2 2 where
levels = log2 (N) − 3 (5.13) and
N Switches at jth level = (5.14) 2j+1
The wire lengths between switches in the BFT architecture, based on the layout are shown in figure 4. The interswitch wire length can be calculated using the following expression [10] √ Area l = (5.15) a+1,a 2levels−a 106 Where la+1,a is the length of the wire spanning the distance between level ’a’ and level
’a+1’ switches, where a can take integer values between 0 and (levels-1). Thus total length of the interconnects, and total number of repeaters could be calculated using the following expressions √ Area l = (levels.N.N ) (5.16) total 2levels wires
Nrep = N × Nwires (5.17) l1,0 1 l2,1 1 l3,2 1 llevels,levels−1 × + + + ··· + log (N)−3 kopt 2 kopt 4 kopt 2 2 kopt
Where kopt is the optimal length of the global interconnect in between repeaters. Using the total number of switches, total wire length and the total number of repeaters, the total power dissipation in BFT architecture could be calculated using the following expression. ! √ N 1levels Area P = P . 1 − + (levels.N.N ) CV 2 .f + N.N total switch 2 2 2levels wires DD wires l1,0 1 l2,1 1 l3,2 1 llevels,levels−1 × + + + ··· + log (N)−3 kopt 2 kopt 4 kopt 2 2 kopt 2 × αhoptCoVddf + Pleak−rep + Pshort−rep (5.18)
5.2.3 SPIN Architecture Power Model
As explained earlier in Chapter2, SPIN topology is based on fat tree topology;
every node has four sons and the father is replicated four times at any level of the
tree. This topology carries some redundant paths, but offers higher throughput at
the cost of added area. SPIN is scalable and uses small number of routers for a given
number of IPs. In a large SPIN (>16 IPs), the total number of switches is 3N/4 [17].
An efficient floor plan for the SPIN architecture is shown in Figure 5.4. Based on this
107 Figure 5.4: Layout of SPIN architecture
floor plan, the interswitch wire length can be calculated using eq. 6.14. The total wire length and number of repeaters can be calculated using the following expressions. √ ltotal = 0.875 Area.N.Nwires (5.19) $√ % $√ % $√ %! Area Area Area Nrep = N.Nwires + + (5.20) 8.kopt 4.kopt 2.kopt The total power consumption of the spin architecture using the total length of the interconnect and the total number of repeaters thus can be calculated using equation
5.21 2N √ P = P . + 0.875 Area .N.N .C.V 2 .f total switch 4 wires DD $√ % $√ % $√ %! Area Area Area (5.21) + N.Nwires + + 8.kopt 4.kopt 2.kopt 2 × αhoptCoVddf + Pleak−rep + Pshort−rep
5.2.4 Octagon Architecture Power Model
A basic Octagon unit consists of eight nodes and 12 bidirectional links. Each node is associated with one IP, therefore number of switches in an Octagon unit equals to
108 the number of IPs. For a system containing more than eight nodes, the octagon unit is expanded to multi-dimensional space using multiple basic Octagon units. An efficient layout scheme for Octagon architecture is shown in Figure ??. Based on
Figure 5.5: Layout of Octagon architecture
this layout scheme, there are four different interswitch wire lengths in the Octagon architecture [18]. First set is connecting nodes 1-5 and 4 -8, second set is connecting nodes 2-6 and 3-7, third connecting nodes 1-8 and 4-5, and fourth is connecting nodes
1-2, 2-3, 3-4, 5-6, 6-7, and 7-8. The interswitch wire lengths can be calculated using the following expressions
3L l = (5.22) 1 4 L l = 13.w .N + (5.23) 2 l wires 4
l3 = 13.wl.Nwires (5.24) L l = (5.25) 4 4 √ √ Where L is the length of four nodes and is equal to (4. Area/ N). wl is the summation of global interconnect width and space. Considering different interswitch wire lengths, the total length of interconnect and total number of repeaters required
109 could be calculated using the following expressions.
7 l = L + 52w N N .N (5.26) total 2 l wires wires oct−units
3L/4 13wlNwires + L/4 13wlNwires L/4 Nrep = 2 + 2 + 2 + 6 (5.27) kopt kopt kopt kopt
× NwiresNoct−units
Where Noct−units is the number of basic octagon units. The total power dissipation for the Octagon network, can thus be calculated using the following expression √ ! ! Area P = P + 14 + 52w N N .N CV 2 f + N total switches N l wires wires oct−units DD wires 3L/4 13wlNwires + L/4 13wlNwires L/4 × Noct−units 2 + 2 + 2 + 6 kopt kopt kopt kopt 2 × αhoptCoVDDf + Prep−leak + Prep−short (5.28)
5.3 IP Based Design Methodology for NoC
IP based designs are the dominant way to design a large system containing billions of transistors in a reasonable amount of time. IP based design differs from custom designs in that, IPs are designed well; before they are used. Therefore in these designs most of the system requirements, such as bandwidth, area and power consumption are known a priori. The life cycle of finely designed IPs may stretch well over the years from the time they are first created through several generations of technology until their final retirement. However, it is natural that with every generation of technology scaling the capacity to integrate similar type of IPs doubles or its area halves. An example to show natural progression for the number of IPs that can be
fit on the same die due to technology scaling is shown through Figure 5.6 Functional
110 Figure 5.6: Number of cores with technology scaling
IP blocks are not discussed, since they are dependent on the specific application.
They are considered as a set of embedded processors. In NoC, power is a function of number of IPs and Die Size. Depending on the number of IPs and an estimated
Die area required by them, a low power topology can be selected using power models presented earlier. In different NoC architectures power dissipation vary due to the difference in the total interconnect wire length, number of switches, and total number of repeaters required by the topology. Number of repeaters required depends on the lengths of interconnect, and it varies in different topologies. As number of IPS are increased for a given die area, some topologies scale well with shorter interconnect lengths, whereas others do not. The lengths of the longest interconnect for different
NoC architectures, as the numbers of IPs are increased on a 20mm x 20mm die size is shown in Figure 5.7. Interconnects for Cliche architecture scales well i..e the length of the longest interconnect is reducing with increased number of IPs. In other topologies, some interconnect scales, but the longest interconnect does not scale. For a desired bandwidth requirement, longer interconnects may require optimization in
111 Figure 5.7: Length of longest interconnect with increasing number of IPs
terms of width and spacing along with repeater insertion. This results in extra area and power costs. Since power is a most critical design constraint and interconnects consume significant amount of power, it must be pre-accounted for in the design process. To aid designers in early stages of design, architectural power models are presented earlier for rough power estimations and an efficient design flow including this step is shown in Figure 5.8. Architectural level power estimation is extremely important in order to (1) verify that power budgets are approximately met and (2) evaluate the effect of various high-level optimizations, which have been shown to have much more significant impact on power than low-level optimizations.
5.4 Network Power Analysis
A synchronous router from HDL design is implemented using ARM´sstandard cell library in 65nm CMOS TSMC design process. Synopsys’s Prime Time-PX tool is used for calculating average power dissipation. A 6-port switch consumes 9.62 mw
112 Figure 5.8: A Methodology for Power Efficient NoC Design
of power at 200MHz frequency. Using [48], for 65nm technology node, the critical interconnect length is 1.44 mm. The links are assumed to be bidirectional, with 8 data lines and 2 control signal lines per link. The optimal interconnect width of 799, optimal interconnect spacing of 329; and an optimal repeater size of 105 is used. For design space exploration using power models and IP- based design flow presented earlier; power variance among different NoC topologies is shown in Figure 5.9.A range of 16 - 1024 IPs and a die size of 25mm2 to 400mm2 are used. SPIN topology consumes the highest power, whereas BFT is more power efficient. Considering 20mm x 20mm die size, power consumed by different architectures for 16, 64 and 256 IPS is presented in Table 5.1- Table 5.3. Power Consumption by wires and repeaters is
113 (a) Cliche Architecture (b) BFT Architecture
(c) SPIN Architecture (d) Octagon Architecture
Figure 5.9: Total Power of the Network
also presented. A system overhead in terms of 100 Watts [52] full power budget is evaluated. A detailed analysis of power consumptions helps designers to save more power through different approaches that may be applicable. For 256 IPs, repeaters alone can consume as much as 1209.6 mw of power in SPIN topology; this is quite significant, considering the size and high end power budget of the chip. Total power consumed by the BFT architecture is less in all the three cases, however it is important to see how different components are contributing to the total power consumptions.
114 Table 5.1: Power Consumption for 16 IPs
Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 16 1200 720 205.54 0.21 BFT 6 1600 960 160.68 0.16 SPIN 8 2800 1600 563.12 0.25 Octagon 16 1411.73 880 194.96 0.19
Table 5.2: Power Consumption for 64 IPs
Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 64 2800 1120 667.00 0.67 BFT 28 4800 2560 563.12 0.56 SPIN 48 11200 6400 1167.40 1.17 Octagon 64 2846.92 1440 580.77 0.58
Table 5.3: Power Consumption for 256 IPs
Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 256 6000 0 2269.10 2.27 BFT 124 8000 2560 1601.80 1.60 SPIN 192 44800 25600 4669.40 4.67 Octagon 256 5787.70 1280 190982 1.91
115 (a) Cliche Architecture (b) BFT Architecture
(c) Octagon Architecture (d) SPIN Architecture
Figure 5.10: Distribution of NoC Power Consumption
Pie charts from the developed power models are plotted to observe the individual contribution of power by various components to the total power. Figure 5.10 shows the parameterized contribution of power by the routers, interconnects and repeaters for the case of 64IPs in different NoC topologies. It helps designers to focus In Cliche, the biggest source of power consumption is switches. Although it is second to BFT in total power consumption, but it consumes less power in repeaters. The contribution of power by the BFT switches to the total power is lowest among all the topologies.
116 5.5 Summary
Power being a first-order design objective must be modeled early in the design
flow. The difference in power budgets of a logic design vs. physical design can either offset a design completely or may result in endless iterations to get the design to work. Therefore, the need for fast power estimation is a growing requirement in large scale SoC designs including NoCs. Power estimations in the early phases of design help designer to optimize the design for energy consumption and to efficiently map applications to achieve low power solution. Power can be estimated at a number of levels with varying degree of details. In NoC, an accurate estimation of power at architectural level can save ten times as much power as methods used later in the design flow. In this chapter, an efficient design methodology to estimate NoC power at architectural level is presented. The analysis is based on architectural layout and power models. These models can be used to accelerate NoC design process for low power solution and hence timing closure.
117 CHAPTER 6
Power Efficient Asynchronous Network on Chip Architecture
Most conventional SoC designs are synchronous in nature i..e they have a global clock signal which provides a common timing reference for the operation of all the cir- cuitry on the chip. However, trends of increasing die sizes and rising transistor counts may soon lead to a situation in which distributing a high-frequency global clock sig- nal with low skew will become extremely challenging in terms of design efforts and power dissipation. A large part of the power is also spent in the clock tree Network.
Studies show that on an average power consumed by the clock network could be as high as 40% of total power consumption of the chip. The high fraction is cuased by the fact that large global wires in the clock tree are switched often. To solve these two critical issues, several methods are being discussed in research lieteratue [53]. One of the most common solution proposed is using Asynchronous communication be- tween locally clocked regions i..e Globally Asynchronous Locally Synchronous (GALS) communication [54]. The basic idea of GALS is to partition a system into multiple independently clocked domains. Each domain is clocked synchronously while interdo- main communication is achieved through specific interconnect schemes and circuits in a self-timed fashion. Thus the functionality of each subsystem is still described and synthesized along well established synchronous design, while the communication
118 between locally synchronous modules requires specialized asynchronous components.
Due to its flexible portability and transparent features regardless of the felxibility among computational cores, GALs interconnect schme is a top choice for IP based or multi- and many- core chips.
A GALS-based design style fits nicely with the concept of Network-on-Chip (NoC) design. NoC combined with a Globally Asynchronous Locally Synchronous design is a natural enabler for easy IP integration, scalable communication and provides a clear split between different timing domains. In addition, GALS allows the possibility of
fine-grained power reduction through frequency and voltage scaling. Despite these benefits of GALS design approach, Asynchronous NoC research however is still in the early stages and only a limited set of research exists in the area. A port interface highlighting the difference between (a) a Synchronous NoC Switch and (b) an Asyn- chronous NoC Switch is shown in Figure 6.1.
Figure 6.1: Port Interface (a) Asynchronous Design(b) Synchronous Design
The design approach in asynchronous design is based on purely asynchronous clock- less handshaking that uses multiple phases of exchanging control signals (request and
119 ack) for transferring data words across clock domains. The operation of the switch is discussed in more detail in next section.
6.1 Asynchronous NoC Architecture
A typical NoC architecture is composed of multiple routers and network interfaces
(NI) which connects the IP blocks to the network. Figure 6.2 shows an Asynchronous
NoC communication architecture. It is similar to a Synchronous NoC architecture as it has same main basic building blocks. It consists of (i) Switches, (ii) Links and (iii)
IP blocks.
Figure 6.2: Asynchronous NoC Architecture
The main function of the switches is to accept the incoming data flow of packets, compute where to transmit, arbitrate between the potential concurrent data requests and finally transmit the selected data flow onto an appropriate output link. The
IP blocks or nodes of the network are connected to switches through Asynchronous wrapper units and NoC links. The whole asynchronous network is implemented as a
GALS system i..e the IP blocks are Synchronous, while the communication network
120 is implemented as a Quasi-Delay Insensitive (QDI) Asynchronous logic. As shown in Figure 6.2, synchronization and communication between the NoC switch and the
Synchronous unit is through pluasable clock mechanism called SAS (Synchronous-to-
Asynchronous and Asynchronous-to-Synchronous Interfaces). A programmable local clock generator, using a programmable delay line is implemented within each unit to generate a variable frequency in a predefined and programmable tuning range.
The communication between network switches is accomplished using Asynchronous-
Asynchronous (ASAS) interfaces. The switch has different number of ports for dif- ferent NoC topologies. Using GALS approach, the port interface design for the asyn- chronous switch is shown in Figure 6.3.
Figure 6.3: Asynchronous Port Architecture
In asynchronous design, interswitch links include request acknowledge and data sig- nals. Each port of the switch include (i) Input Asynchronous FIFO, (ii) A Header
Decoder and (iii) Controller modules. The messages arrive in fixed length flow control
121 units (flits), and when the input FIFO stores one whole flit, it sends a full signal to the controller for the service of next processing step. If it is a header flit, the header decoder unit determines the destination port and controller checks the status of the destination port. If the port is available, the path between the input and output is established. All subsequent flits of the corresponding packet are sent from input to output using the established path. The two-way handshaking is used for controlling the transmission. The transitions of the request and acknowledgment signals indicate the completion of the transfer [55]. The number of cells in the asynchronous FIFO are equal to the number of bits in one flit. In asynchronous design, the cell consists of Put
Token Controller (PTC) which deals with the put operation, Get Token Controller
(GTC) which deals with the get operation and Data Status Controller (DSC) [56].
An asynchronous FIFO cell implementation along with its data flow details in shown in Figure 6.4.
Figure 6.4: Asnynchronous FIFO Cell
The register is split into two parts, one belonging to the put part (the write port)
122 and one belonging to the get part (the read port). The behavior of cell could be understood by tracing a put operation and a get operation [57]. The cell receives input data as follows: the put token signal (put tok) is asserted after two transitions on input write enable signal (IWE), as shown in Figure 6.5. When put request is received (put req = 1) , and output write enable (OWE) is asserted. This event causes three operations in parallel: (i) the valid signal is asserted ( the state of the cell becomes “full ”) by DSC, (ii)register REG is enabled to latch the input data and
(iii) the cell starts to send the put token to the left cell and reset the put token signal
(put tok=0). When the put req is de-asserted, the OWE is also de-asserted. The cell is ready to start another put operation once the data from REG is out.
Figure 6.5: PTC Circuit
The cell send the stored data in a similar way as shown in Figure 6.6. After two transistions on an input read enable signal (IRE), the get token signal (GT) is as- serted. The register outputs its data onto the global get data bus. When a get request
(get req=1) is received, an output read enable signal (ORE) is deasserted (ORE=0),
123 GT is reset (GT=0), and the state of cell is changed to ”‘empty”’ (valid=0) by the
DSC.
Figure 6.6: GTC Circuit
Using the burst-mode specifications described in [58], [59], [60] and [61]. the burst mode specification of PTC and GTC are shown in Figure 6.7. A burst-mode specification is a mealy type finite-state machine consisting of a set of states, a set of arcs, and a unique starting state [60]-[61]. An arc is labeled with an input burst (a set of transitions on the input signals), followed by an out put burst (a set of transition on the output signals, possibly empty). A burst mode machine waits for a complete input burst to arrive; transition may come in nay order and at any time.
Once the complete input burst has arrived, the output burst is generated, and the machine moves to the next specification state. For example, for PTC specification, the machine starts in state 0 and waits for a rising transition on IWE+ (where + indicates a rising transition); once this arrives, the machine simply moves to state
1 since the output burst is empty. In state 1 the machine waits for the input burst
124 IWE- (where - indicates a falling transition). Once the IWE- has arrived, put tok is generated.
Figure 6.7: Burst Mode Specification of PTC and GTC
DSC has two inputs and an output which indicates when the cell is full. A DSC configuration is shown in Figure 6.8. The output (called busy) is 1 when OWE is 1 and it is 0 when ORE is 0 ( after being 1 previously).
Figure 6.8: DSC Circuit
125 The Asynchronous FIFO, header decoder and port controller are implemented using standard cell library. The clock distribution network for the synchronous architecture is discussed next.
6.2 Synchronous NoC Architecture
In Synchronous architecture, a clock signal is required for all the clocked elements in the switches. A port-to-port interface for a synchronous architecture is shown in Figure 6.9. The Write and Full signals are used in the switch for controlling the operation of synchronous input and out FIFO. The two most common styles
Figure 6.9: Synchronous Switch Port Design
of physical clocking network are H Tree and Balanced Tree. The H-Tree is a very regular structure which allows predictable delay. The balanced tree takes the opposite approach of synthesizing a layout based on the characteristics of the circuit to be clocked. To understand the complexities associated with the clock distribution,
126 (a) 16 IPs
(b) 64 IPs
(c) 256 IPs
Figure 6.10: Clock Tree Network for Synchronous BFT Architecture
127 let us consider BFT architecture. A H-block clock distribution scheme is used for clock distribution and a template scheme for 16, 64 and 256 IPs is shown in Figure
6.10. The IP blocks are shown as white, and the switches as gray squares. A BFT architecture is a 4-ary tree topology with switches connecting 4 down links and 2 up links. Each group of 4 leaf nodes need one switch. At the next level, half as many switches are required (the set of every 4 switches at any level only requires 2 switches at next higher level). This reduction in switches relation continues on each succeeding level. A clock signal is needed across all IP blocks and switches. With the increase in the number of IP blocks, the complexity of the clock tree increases as well.
Maintaining clock tree symmetry and distributing clock in synchronous network thus is a difficult task.
6.3 Power Dissipation
Power dissipation of on-chip network is defined as
Ptotal = Pswitches + Pline + Prep (6.1)