<<

Designing Low Power and High Performance Network-on-Chip Communication Architectures for Nanometer SoCs

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Gursharan Reehal, B.S., M.S.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2012

Dissertation Committee:

Prof. Mohammed Ismail El-Naggar, Advisor Prof. Steve Bibyk Prof. Joanne DeGroat c Copyright by

Gursharan Reehal

2012 ABSTRACT

Network-on-Chip (NoC) communication architectures have been recognized as the most scalable and efficient solution for on chip communication challenges in the multi-core era. Diverse demanding applications coupled with the ability to integrate billions of transistors on a single chip are some of the main driving forces behind ever increasing performance requirements towards the level that requires several tens to over a hundred of cores per chip, with aggregate performance exceeding one trillion operations per second. Such tera-scale many-core processors will be highly integrated

System-on-Chip designs (SoC) containing a variety of on-chip storage elements, mem- ory controllers, and input/output (I/O) functional blocks. Small scale multicore pro- cessors so far have been a great commercial success and found applicability in high , computer intensive applications including high performance, oriented, scientific , high performance graphics and 3-D immersive visual interfaces, as well as in decision and support systems. Systems using multi-core pro- cessors are now the norm rather than the exception.

As the number of cores or components integrated into a single system is keep increasing, the design of on-chip communication architecture is becoming more chal- lenging. The increasing number of components in a system translates into more

ii inter-component communication that must be handled by the on-chip communica- tion infrastructure. It’s not surprising to see that leading-edge design teams search- ing for better solutions as multi-core SoCs continue to evolve. Future system-on-chip

(SoC) designs require predictable, scalable and reusable on-chip communication ar- chitectures to increase reliability and productivity. Current -based interconnect architectures are inherently non-scalable, less adaptable for reuse and their reliability decreases with system size.

NoC communication guarantees scalability, high-speed, high-bandwidth commu- nication with minimal wiring overhead and routing issues. NoCs are layered, packet- based on-chip communication networks integrated onto a single chip and their opera- tion is based on the operating principle of macro networks. NoC consists of resources and switches that are directly connected in a way that resources are able to com- municate with each other by sending messages. The proficiency of a NoC to meet its design goals and budget requirements for the target application depends on its design. Often, these design goals conflict and trade-off with each other. The multi- dimensional pull of design constraints in addition to technology scaling complicates the process of NoC design in many aspects, as they are expected to support high performance and reliability along with low cost, smaller area, less time-to-market and lower power consumption. To aid the process, this research presents design method- ologies to achieve low power and high performance NoC communication architectures for nanometer SoCs.

In NoCs, interconnects play a crucial role in the overall system performance and can have a large impact on the total power consumption, wiring area and achievable system performance. The effect of technology scaling on the NoC interconnects is

iii studied and an improved design flow is presented. The influence of technology node, die size, number of components on the power consumption by the NoC interconnects is analyzed.

The success of NoC heavily depends on its power budget. As CMOS technology continues to scale, power aware design is more important than ever before, especially for the designs targeted towards low power applications, however in large scale NoCs the power consumption can increase beyond acceptable limits. Designing low power

NoC is therefore extremely important especially for larger SoCs designs. The elevation of power to a first-class design constraint requires that power estimations are done at the same time as the performance studies in the design flow. In NoC, one method to have a power aware design is to consider the impact of the architectural choices on the power in the early stages of design process. In this research an efficient design methodology based on the layout and power models is presented to have rough power estimates in the early stages of design cycle. The impact of die size and number of

IPs on the power consumed by different NoC architectures is evaluated.

Additionally, as multi-core SoCs continue to evolve, Globally Asynchronous Lo- cally Synchronous (GALS) design techniques have been suggested as a potential solu- tion in larger and faster SoC designs to avoid the problems related to synchronization and clock skew. These multi-core SoCs will operate using GALS paradigm, where each core can operate in a separate clock domain. In this research a study on the power efficiency between Synchronous and Asynchronous NoC architecures is presented.

Asynchronous NoC architecture is shown to consume less power, when activity factor of data transfers between two switches is within certain range. Asynchronous designs are more power efficient, as the need for clock distribution is eliminated.

iv Dedicated to my mother, for her love, support and encouragement. . .

v ACKNOWLEDGMENTS

I would like to thank many remarkable people who have supported and encouraged me during my time at The Ohio State University. First and foremost, I would like to thank my advisor, Prof. Mohammed Ismail for his continued guidance, support and sustained encouragement throughout my graduate study. He always encouraged me to think independently and to define my own research agenda, allowing me to develop the skills necessary to do research. His significant effort in bringing the

Software, an industry standard EDA tool for research in the area of Digital VLSI and henceforth enriching the quality of education and student research experience here at The Ohio State University is highly remarkable. Without his commitment and encouragement, this dissertation would not have been possible. The experience with

Prof. Ismail will always be highly regarded.

I would like to thank Prof. Steve Bibyk for his early guidance, mentorship, en- couragement and support to pursue the PhD program. I am thankful for his valuable discussions throughout my time in the ECE department and for being my MS advi- sor. It truly has been a great experience working with him. I am also very grateful to Prof. Joanne DeGroat for her support, guidance and for kindly serving on my

PhD exam committees. I am thankful to her for treating me as a member of her own group.

vi I am thankful for the opportunity to do some research with Mohamed Abd Ghany, who is affiliated with the German University Cairo in Cairo Egypt. He has been a very helpful friend and a great source of guidance in the work on Asynchronous NoCs.

I am grateful for his insight in the asynchronous digital design and for his support. I enjoyed working with him.

I am honored to call myself a member of the VLSI Lab. I want to thank my fellow colleagues Amneh Akour, Sleiman Bou Sleiman, John Hu, Sidharth Balasubramanian,

Yiqiao Lin, Feiran Lei and Samantha Yoder for their friendship, discussions and valuable guidance. In particular, I would like to thank Amneh Akour, who has been a great source of sage advice in the times, when I needed it the most. It truly has been a privilege for me to work with them in the VLSI lab and I always enjoyed and cherished their company.

I would also like to thank many people in the ECE department. Stephanie

Muldrow, Carol Duhigg, Tricia Toothman, Vincent Juodvalkis, Aaron Aufderheide,

Don Gibb, and Edwin Lim, who work very hard and diligently behind the scenes, to make this department a wonderful place for graduate students. Their friendly and helping nature ease the stress of a graduate student and who are always willing to help and guide students in our department.

Life as a graduate student would not be possible without the help of family and friends. My deepest regards goes to my mother, to whom I dedicate this work. She always encouraged me and supported me to pursue higher education. She worked very hard in making sure, education is a priority. She always emphasized the importance of education, and encouraged me to pursue the PhD program. She has been a great source of inspiration in my life and education. I am deeply thankful to her for her

vii endless love, care, support, wisdom...and basically everything. Your love and faith in me has made all the difference...love you Mom!

My final thanks goes to the reason I am here and able to do this work, my creator.

I am thankful for the talent and opportunities I have been given and the strength to undertake this task and see it to completion!

viii VITA

1996 ...... B.S. Electrical Engineering

1998 ...... M.S. Electrical Engineering

2007- 2009 ...... Graduate Teaching Associate, The Ohio State University. 2010 ...... Graduate Technical Intern, Intel Corporation 2010 ...... Graduate Technical Research Intern, Intel Labs

FIELDS OF STUDY

Major Field: Electrical and Computer Engineering

ix TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... v

Acknowledgments...... vi

Vita...... ix

List of Tables...... xiii

List of Figures...... xv

Chapters:

1. Introduction...... 1

1.1 Bus Based On-Chip Communication Architectures...... 3 1.1.1 AMBA Bus...... 4 1.1.2 CoreConnect Bus...... 5 1.1.3 Bus...... 7 1.1.4 SiliconBackplane MicroNetwork...... 8 1.2 Limitations of the Bus based Communication Approach...... 9 1.3 Why NoC ?...... 12 1.4 NoC Design Considerations and Challenges...... 15 1.5 Organization of this Thesis...... 21

2. NoC Overview : Architecture, Performance and Cost...... 22

2.1 NoC Building Blocks...... 23 2.1.1 Network Interfaces...... 24 2.1.2 Switches...... 24

x 2.1.3 Links...... 25 2.2 NoC Architectures...... 25 2.2.1 CLICHE´...... 27 2.2.2 TORUS...... 28 2.2.3 BFT...... 29 2.2.4 SPIN...... 30 2.2.5 OCTAGON...... 30 2.3 NoC Flow Control Protocols...... 31 2.4 NoC Switching Techniques...... 33 2.5 NoC Routing...... 36 2.6 NoC Performance and Cost...... 38 2.6.1 NoC Power Dissipation...... 39 2.6.2 NoC Area Overhead...... 40 2.6.3 NoC Message ...... 40 2.6.4 NoC Throughput...... 41 2.7 High-level Physical Characteristics of NoC Architectures...... 43 2.8 NoC Design Flow...... 45 2.9 Summary...... 47

3. NoC Router Architecture Design and Cost...... 48

3.1 Main Parts of NoC Router...... 49 3.1.1 Input/Output Ports...... 50 3.1.2 Virtual Channels...... 51 3.1.3 Buffers...... 52 3.1.4 Crossbar Logic...... 52 3.1.5 Input/Output Arbiter...... 54 3.1.6 Control Logic...... 57 3.2 Packet Format...... 59 3.3 NoC Router Design and Cost...... 60 3.3.1 Router Design-I...... 61 3.3.2 Router Design-II using ASIC Design Flow...... 70 3.4 Summary...... 80

4. High Performance NoC Interconnects...... 81

4.1 NoC Interconnects...... 84 4.1.1 Performance Optimization Using Intrinsic RC Model.... 86 4.1.2 Performance Optimization using Repeater Insertion..... 92 4.2 NoC Power Consumption In Physical Links...... 95 4.3 A Layout-Aware NoC Design Methodology...... 97 4.4 Summary...... 97

xi 5. Layout Aware NoC Design Methodology...... 99

5.1 CMOS Power Dissipation...... 100 5.2 Power Analysis for NoC-based Systems...... 103 5.2.1 Cliche Architecture Power Model...... 104 5.2.2 BFT Architecture Power Model...... 105 5.2.3 SPIN Architecture Power Model...... 107 5.2.4 Octagon Architecture Power Model...... 108 5.3 IP Based Design Methodology for NoC...... 110 5.4 Network Power Analysis...... 112 5.5 Summary...... 117

6. Power Efficient Asynchronous Network on Chip Architecture...... 118

6.1 Asynchronous NoC Architecture...... 120 6.2 Synchronous NoC Architecture...... 126 6.3 Power Dissipation...... 128 6.4 Simulation Results...... 131 6.5 Comparison...... 135 6.6 Summary...... 140

7. Conclusion and Future Work...... 142

7.1 Thesis Summary and Contributions...... 143 7.2 Future Work...... 147

Appendices:

A. NoC Examples...... 150

B. Abbreviation...... 162

Bibliography...... 164

xii LIST OF TABLES

Table Page

1.1 Bus Architecture Specification [1]...... 10

3.1 Percentage Increase in Throughput for Different NoC Architectures. 65

3.2 Power Consumption for Different NoC Architectures...... 67

3.3 Power Reduction Per Component using Sleep Transistors...... 69

3.4 Power Reduction of a Switch for Different NoC Architectures using Sleep Transistors...... 69

3.5 Input and Output Ports of NoC Router...... 70

3.6 Power Overhead of Routers for Different NoC Architectures in RVT process(f = 200MHz,α = 0.1)...... 75

3.7 Power Consumption of 4-, 5-, 6-,7- and 8- port NoC routers at various operating frequencies...... 76

3.8 A Guide for Leakage Power Considerations...... 78

3.9 Power Dissipation for a Network of 64 IPs at 200 MHZ and α = 0.1. 79

4.1 Technology and Circuit Model Parameters from ITRS Reports(2001- 2010)...... 82

4.2 Bulk Resistivity of pure metal at 22 degree C...... 88

4.3 Relative Permittivity r of some Dielectric Materials...... 91

xiii 4.4 Interconnect Power and Area Consumption: Intrinsic Case, f=400 MHz and α = 1...... 96

4.5 Interconnect Power and Area Consumption: Width and Space Opti- mization, f=400MHz and α = 1...... 96

4.6 Total Power and Area Consumption, f=400MHz and α = 1..... 96

5.1 Power Consumption for 16 IPs...... 115

5.2 Power Consumption for 64 IPs...... 115

5.3 Power Consumption for 256 IPs...... 115

6.1 Total Metal Resources Required for BFT Architecture...... 135

6.2 Power Consumption For BFT Architecture...... 136

6.3 Power Consumption For Cliche Architecture...... 139

6.4 Power Consumption For Octagon Architecture...... 139

6.5 Power Consumption For SPIN Architecture...... 139

6.6 Total metal resources...... 140

A.1 Intel’s 80-Core Tera Scale Processor Specifications...... 152

A.2 Intel’s 48-Core Single-Chip Cloud Computer Processor Data..... 153

A.3 Tlera’s multicore Processor Data...... 155

xiv LIST OF FIGURES

Figure Page

1.1 Evolution of the IC Integration Level (a) First IC with 4 transistors by Fairchild Semiconductor [2]. (c)Intel Pentium 4 with 50 million transistors...... 1

1.2 AMBA Bus...... 4

1.3 CoreConnect Bus...... 6

1.4 WishBone Bus Interconnection Architectures (Silicore 2002) [3]...7

1.5 Silicon Backplane Micro-Network Bus Architecture...... 8

1.6 Bus Layout Schemes...... 11

1.7 A SoC Design...... 12

1.8 SoC-based consumer portable design complexity trends [4]...... 13

1.9 SoC Design Space...... 14

1.10 Power Density in Intel´sMicroprocessors...... 17

1.11 Gate and Wiring Delay Vs. Future Technology Nodes [5]...... 19

2.1 Conceptual view of Network-on-Chip [6]...... 22

2.2 Network-on-Chip...... 23

2.3 Eleven Standard NoC Topologies...... 26

xv 2.4 CLICHE´ Architecture...... 27

2.5 Torus Architecture...... 28

2.6 BFT Architecture...... 29

2.7 SPIN Architecture...... 30

2.8 Octagon Architecture...... 31

2.9 NoC Switching Techniques...... 33

2.10 Store & Forward Routing Vs. Cut-Through Routing...... 36

3.1 A Generic Router Design...... 48

3.2 An Input Port of the Switch...... 50

3.3 An Input Port of the Switch with Virtual Channels...... 51

3.4 A 3x3 Crossbar Implemented using a Multiplexer for each Output.. 53

3.5 2-D Implementation of a 3x3 Multiplexer based Crossbar...... 54

3.6 A Matrix Arbiter Design...... 56

3.7 State Diagram of Port Controller...... 57

3.8 Control Flow for Input Virtual Channels...... 58

3.9 Packet Format...... 59

3.10 NoC Port I/O...... 61

3.11 Power per Component...... 62

3.12 High Throughput Arbiter Design...... 63

3.13 Max. Frequency of Switch with different number of Virtual Channels 64

3.14 Throughput vs. Virtual Channels for Different NoC Topologies.... 65

xvi 3.15 Latency of NoC Topologies with Different Number of Virtual Channels 66

3.16 NoC Port Design for Reducing Leakage Power...... 68

3.17 Power Consumption for Different NoC Architectures...... 68

3.18 Power Analysis Requirements...... 71

3.19 Power Measurement Flowchart for NoC Routers using Synopsys Tools 72

3.20 Power per Component...... 73

3.21 Total Power Consumed by Routers for Different Number of IPs... 74

3.22 Leakage Power vs Technology Nodes [5]...... 77

3.23 Difference in Leakage Power for a 6 Port Router Design using Different

Vt Cells for a 6 Port Router Design...... 78

3.24 Frequency vs. Area of the Switch...... 79

4.1 Metal Layers in different technology nodes...... 81

4.2 Gate Delay vs. Wire Delay in Different Technology Nodes...... 83

4.3 NoC Interconnects...... 85

4.4 One Clock Cycle Requirement for High Performance NoC Designs.. 86

4.5 Intrinsic RC delay and 15FO4 limit...... 87

4.6 Interconnect Resistance...... 88

4.7 Cross-Sectional View of Semi-Global Layer Interconnects...... 89

4.8 Interconnect with Repeaters...... 93

4.9 An Improved ASIC Design Flow for NoC in Deep Nanometer Regime 98

5.1 Diminshing Returns of Power...... 102

xvii 5.2 Layout of Cliche architecture...... 104

5.3 Layout of BFT architecture...... 106

5.4 Layout of SPIN architecture...... 108

5.5 Layout of Octagon architecture...... 109

5.6 Number of cores with technology scaling...... 111

5.7 Length of longest interconnect with increasing number of IPs..... 112

5.8 A Methodology for Power Efficient NoC Design...... 113

5.9 Total Power of the Network...... 114

5.10 Distribution of NoC Power Consumption...... 116

6.1 Port Interface (a) Asynchronous Design(b) Synchronous Design... 119

6.2 Asynchronous NoC Architecture...... 120

6.3 Asynchronous Port Architecture...... 121

6.4 Asnynchronous FIFO ...... 122

6.5 PTC Circuit...... 123

6.6 GTC Circuit...... 124

6.7 Burst Mode Specification of PTC and GTC...... 125

6.8 DSC Circuit...... 125

6.9 Synchronous Switch Port Design...... 126

6.10 Clock Tree Network for Synchronous BFT Architecture...... 127

6.11 Power Dissipation in Syn. and Asyn. BFT Architecture...... 133

xviii 6.12 Power Dissipation in Syn. and Asyn. BFT Architecture...... 134

6.13 Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5.... 137

6.14 Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5 and

αcs=1/64αdata ...... 138

A.1 Intel’s 80 Core Tera Flop Processor...... 151

A.2 Intel’s 48-Core (24 tiles with two IA cores per tile)SCC Processor.. 152

A.3 Tlera’s Multicore Processor...... 154

A.4 The Blue Gene/Q SoC integrates 18 homogenous cores...... 156

A.5 BONE Evolution...... 158

A.6 FAUST Chip Architecture...... 160

xix CHAPTER 1

Introduction

It all began in 1959, with the invention of first semiconductor based transistor design. Since then, the world of Integrated Circuits (IC) is becoming more and more complex, as shown in Figure 1.1. Every two year with a new technology node, the number of transistors that can be fitted on the same area doubles, a trend roughly following Moore’s Law.

(a) First silicon IC (b) Intel 4004 (c) Intel Pentium4(d) Intel Corei7 Microprocessor (2009) in the history (1959) Ist µProcessor Microprocessor (1971) (2000)

Figure 1.1: Evolution of the IC Integration Level (a) First IC with 4 transistors by Fairchild Semiconductor [2]. (c)Intel Pentium 4 microprocessor with 50 million transistors

With technology scaling, it is now possible to integrate more than two billion transis- tors onto a single chip, and the capacity is still growing with more smaller technology

1 nodes. Market demand for smaller and high performance devices, is keep pushing the semiconductor technology to smaller process nodes, and packing more functionality into a single die than ever before. Consumers demand high-quality, multi-functional and feature-rich electronic products at a low price. As a result, product differentiation is more now than ever before, and is being achieved by increased functionality, higher performance, improved power efficiency and more application-specific features. Over the past ten years, as integrated circuits are increasingly becoming more complex, the industry began to embrace new design and reuse methodologies, that are collectively referred to as System-on-Chip (SoC) design.

The term SoC is fairly new in the semiconductor industry, but very rapidly it is replacing the other more popular acronyms of the past, such as VLSI (Very large

Scale Integration) and ULSI (Ultra large Scale Integration). The change in name is nothing but a mere reflection of the fact that there is a paradigm shift - a shift in focus from Chip Design to System Design. Before the SoC, semiconductor technology and circuits themselves played the central role as a discipline, as an industry or research focus. However in the newer SoC era the focus is now shifting more on the system beyond the chip design.

SoCs at present are being used in limited applications such as mobile phones, smart phones, digital cameras, HDTVs and in gaming consoles, but many many more applications will use them in the near future as they become more powerful and easy to develop. Recently, Intel’s CEO Paul Otelini mentioned that he can easily see a time where Intel will ship more SoCs than standard . This statement is truly remarkable in the sense that it clearly outlines the focus of a big semiconductor company like Intel in the near future.

2 SoCs is a system level solution, by integrating many different components, con- nected together on a single chip to achieve a common goal, with a final application in mind. The big advantage of any SoC design is that it offers tremendous computation power as a complete system into a single chip. In the earlier concepts of SoCs the main goal was to copy the system implemented on a PCB with discrete components onto a single silicon chip by adopting the same bus architectures as those used in the

PCB. Previously these components were few and were interconnected using shared bus architecture, but now with the more complex SoC designs, the number of IP blocks is keep increasing, and as a result the performance of this shared bus approach is unfit for designs with larger numbers of IPs. In shared bus approach, arbitration is used among several requesters and when more and more components are attached to the same single bus, it introduces an increased load on the bus, and as a result the speed of the bus drops. To solve this problem new design approaches are crucial for the vitality of future SoC designs with thousands of IPs. Before diving into the the discussion of solution to this problem, some of the main shared bus schemes used in some of the contemporary SoC designs are discussed in the next section.

1.1 Bus Based On-Chip Communication Architectures

A bus, is a group of lines that serves as a communication path for several devices.

In addition to the lines that carry the data, the bus also has lines for address and control signals. A Shared bus or simple a bus is still the most common way to move on chip-data in SoC designs with fewer IPs and is commonly found in many commercial

SoCs of present. In this scheme, several masters and slaves can be connected to a shared bus. A bus arbiter, periodically examines accumulated requests from the

3 multiple master interfaces and grants access to a master using arbitration mechanisms specified by the bus protocol. In a shared bus architecture communication, there are many advantages such as, simple topology, extensibility, low area cost, easy to build, efficient to implement. However, increased load on global bus lines limits the bus bandwidth and as a result there are longer delay for data transfer with larger energy consumption in this approach. Some of the main bus architecture design are

1.1.1 AMBA Bus

AMBA (Advanced Microcontroller Bus Architecture) bus standard is developed by

ARM with an aim to support efficient on-chip communication among ARM processor cores. Nowadays, AMBA is one of the leading on-chip busing systems used in high performance SoC designs. A typical AMBA configuration is shown in Figure 1.2.

AMBA is hierarchically organized into two bus segments, System and

Figure 1.2: AMBA Bus

bus, mutually connected via bridge that buffers data and operation between them.

Standard bus protocols for connecting on-chip components generalized for different

SoC structures, independent of processor type are defined by AMBA specifications.

4 AMBA does not define the methods for arbitration, instead it allows the arbiter to be designed to suit the application needs. There are three distinct buses specified within the AMBA bus for different applications, namely (i) ASB (Advanced ), (ii)

AHB (Advanced High Performance Bus) and (iii) APB(Advanced Peripheral Bus).

Recently, two new specifications for AMBA bus, Multi-Layer AHB and AMBA AXI, are defined. Multi-layer AHB provides more flexible interconnect architecture (matrix which enables parallel access paths between multiple masters and slaves) with respect to AMBA AHB, and keeps the AHB protocol unchanged. AMBA AXI is based on the concept of point-to-point connection [7][8] [9].

1.1.2 CoreConnect Bus

CoreConnect is an on-chip bus architecture from IBM for System-on-Chip (SoC) designs. Initially developed in 1999, it is a macro-based design platform for efficiently integrating complex SoC designs consisting of processors, system blocks, and periph- eral cores. Macro based design provides numerous benefits during logic entry and verification, but the ability to reuse intellectual property for a standard or custom

SoC design is often the most significant one. By using common or generic macros for serial ports to complex memory controllers and processor cores, the design of a complex SoC can be easily accomplished. A typical connection scheme for CoreCon- nect Bus is shown in Figure 1.5. IBM’s CoreConnect is a hierarchically organized architecture consisting of three buses for interconnecting cores, macros, and custom logic. The main elements of three buses are (i) Processor Local Bus(PLB),

(ii) On-chip Peripheral Bus (OPB), a bus bridge, and the (ii) Device-Control Reg- ister Bus (DCRB). The PLB and OPB buses provide the primary means of data

5 Figure 1.3: CoreConnect Bus

flow among macro elements. PLB is the main system bus for high performance pe- ripherals. It is synchronous, multi-master, and centrally arbitrated bus designed for achieving high-performance and low latency on-chip communication. Slower periph- eral cores connect to the OPB bus. The OPB bus is a secondary bus architected to alleviate system performance bottlenecks by reducing capacitive loading on the PLB.

Peripherals suitable for attachment to the OPB include serial ports, parallel ports,

UARTs, GPIO, timers and other low-bandwidth devices. the PLB masters gain ac- cess to the on the OPB bus through the OPB bridge macro. DCRB bus, the third one in the system is a single-master bus mainly used as an alternative for relatively low speed data. For example lower performance status and configuration registers are typically read and written through the DCRB bus. The DCRB provides a maximum throughput of one read or write transfer every two cycles and is a fully synchronous. It is typically implemented as a distributed multiplexer across the chip.

CoreConnect implements arbitration based on a static priority with programmable

6 priority fairness. CoreConnect through this configuration can provide an efficient interconnection of cores, library macros, and custom logic for any SoC design.

1.1.3 WishBone Bus

The Wishbone System-on-Chip (SoC) interconnect bus architecture is developed by Silicore Corporation [3]. It is an opencore architecture for portable IP cores with a portable interface for use with semiconductor IP cores. It employs 8-bit to 64-bit standard buses to interconnects portable IP cores such as CPUs, processors, DSPs and other peripheral cores. It defines two types of interfaces, called master and slave.

Master interfaces are IPs, capable of initiating bus cycles, whereas slave interfaces are capable of accepting bus cycles. As shown in Figure 1.4 the hardware implementation

(a) Point-to-Point (b) Crossbar(switch) (c) Shared bus

Figure 1.4: WishBone Bus Interconnection Architectures (Silicore 2002) [3]

for WishBone bus supports various types of interconnection topologies such as (i)

Point-to-Point interconnection - for direct connection between two components (ii)

Shared Bus - typical for MPSoCs organized around single system bus (iii) Crossbar

Switch Interconnection - usually used in MPSoCs when more than one master can simultaneously access several different slaves. With different bus architectures along

7 with a good arbitration mechanism, such as a Priority bus, TDMA bus, Round Robin bus, and relatively new Lottery bus, and QoS mechanism, provide a generic backbone for efficient interconnection between system components. In applications where two buses are required, one slow and one fast, two separate wishbone interfaces could be used. Designer can also choose the arbitration mechanism and implement it to fit with the needs of the application.

1.1.4 SiliconBackplane MicroNetwork

Sonics0s SiliconBackplane MicroNetwork is a quasi on-chip bus to which users at- tach intellectual-property blocks to create system-on-chip designs. SiliconBackplane

MicroNetwork is a heterogeneous, integrated network that unifies, decouples, and manages all of the communication between processors, memories, and input/output devices. Figure 1.5 shows a SoC design using MicroNetwork architecture. MicroNetwork

Figure 1.5: Silicon Backplane Micro-Network Bus Architecture

isolates the system of IP blocks from network by requiring all blocks to use single bus interface protocol known as Open Core Protocol (OCP). The OCP defines a com- prehensive, bus-independent, high-performance, and configurable interface between

8 IP cores and on-chip communication subsystems. OCP enables SoC designers to in- tegrate IP cores in a plug-and-play fashion. Every IP block communicates through a wrapper, which MicroNetwork calls an agent through OCP. For changing system requirements, MicroNetwork support modifications of many bus parameters in real time. System requirements relates to, selection of arbitration scheme, definition of address space etc. A new agent is generated using the tool Fast Forward Develop- ment Environment, developed by Sonics Inc. When compared to a tradional bus architecture, Sonics SiliconBackplane has the advantages of higher efficiency, flexible configuration, guaranteed bandwidth, latency and integrated arbitration.

1.2 Limitations of the Bus based Communication Approach

In the earlier concepts of SoCs the main goal was to copy the system implemented on a PCB with discrete components onto a single silicon chip by adopting the same bus architectures as those used in the PCB. Previously these components were few and shared bus approach was sufficient, but now with the more complex SoC designs, the number of IP blocks is keep increasing, and as a result the performance of this shared bus approach is unfit for designs with larger numbers of IPs. Some of the published data related to AMBA, CoreConnect and Network is shown in Table 1.1.

In a shared bus approach, arbitration is used among several requesters and when more and more components are attached to the same single bus, it introduces an increased load on the bus, and as a result the speed of the bus drops. Additionally, large SoC design usually have large die sizes, and in a bus based communication, some control signal need to traverse the whole bus length several times within a single clock cycle and with technology shrinking in deep nanometer regime, interconnect delay is

9 Table 1.1: Bus Architecture Specification [1]

Technology AMBA CoreConnect WishBone MicroNetwork Company ARM IBM Silicore Corporation Sonics Core Type Soft/Hard Soft Soft Soft Bus Width (bits) 8-1024 32/64/128 8-64 16 Frequency 200 MHz 100-400 MHz 55-203 MHz 300 MHz Max Bandwidth 3 GB/s 2.5-24 GB/s 0.1-0.4 GB/s 4.8 GB/s Min Latency 5us 15ns n/a n/a

exponentially increasing making it nearly impossible to achieve this target in one clock cycle. To explain this in more detail, an example is worked out below.

• An Example - In a bus based communication system, even if the arbitration

is pipelined and takes place in an earlier cycle, the request for the bus still

must be or’ed between the components and then fanned out to all the receiving

components. The intended receiver must decode the request, decide if it is tar-

geted to it, and then issue an acknowledgment that must be registered by all

the components on the bus. This is a typical scenario of a bus based communi-

cation. A possible bus layout scheme for two different designs and technology

nodes is shown in Figure 1.6. Longer bus takes more time and are more slower.

Additionally as the number of component added on the bus increases, the speed

of the bus drops further and offsetting all the advantages of more functionality.

For calculation purposes, a die size of 10mm is considered. The first case shows

4 cores in 65nm process and the second is 8 cores in 45nm. The 8 cores accounts

for the scaling effects. In the first case bus must expand at least 10mm in order

10 Figure 1.6: Bus Layout Schemes

to provide connectivity to all the cores on the chip and just with the technology

scaling the length needed in the 8 core chip is for the same die size is now 28

mm to reach all the 8 cores on the chip.

Due to process techniques improvements and restrictions, newer buses do not utilize tristate buffers - usually they make use of Mux’s or Ors to combine inputs together and then fan-out the result. Buses have many wires and create congestion as these wires must converge at the Mux block. To overcome the congestion usually Muxes are implemented in a distributive manner, but this increases the number of logic level used in the design and may lead to increased delays or lower operating frequency.

Moreover, with more number of IPs, the load of the bus increases further and speed becomes an issue. To solve this problem, a new communication approach, Network- on-Chip (NoC) has been proposed to solve this on chip communication problem. A

Network-on-chip (NoC) is an efficient on-chip communication architecture for System- on-Chip architectures. It enables integration of a large number of computational and storage blocks on a single chip.

11 1.3 Why NoC ?

As discussed earlier, constantly shrinking process technologies and increasing de- sign sizes have led to highly complex billion-transistor ICs. We passed the point long ago when even a large team can design an entire chip from scratch and in a reason- able amount of time. Faced with this challenge, designing systems using Intellectual

Property (IP) modules has become a dominant mode of chip design today. An early form of IP was the standard cell, which dates back to the early 1970s. Today IP components include the entire range of modules, ranging from standard cells to pro- cessors, accelerators, memories, and I/O devices. An example for SoC design is shown in Figure 1.7. The similar IP style design trend follows for the processors, the ever-

Figure 1.7: A SoC Design

increasing demand for performance countered by Power Wall in the decade of 1990s, designing processors with many cores has been widely accepted in the industry. The number of cores in a general purpose processor is expected to scale to several tens and possibly over a hundred cores by the end of the decade, leading towards the aggregate

12 performance in trillions of operations per second. A typical implementation of such a processor will include, multiple levels of cache memory hierarchy and interfaces to off-chip memory and I/O devices in addition to tens or hundreds of general-purpose cores. The future trend for the number of cores in an SoC type designs projected by

ITRS 2009 is shown in Figure 1.9. In SoC designs, much of the added value comes

Figure 1.8: SoC-based consumer portable design complexity trends [4]

from the ability to identify the right combination of components to be put on the chip.

Many of those components are standardized - either they are based on open standards or they are licensed from IP providers who own a standard. Large productivity gains can be achieved using this SoC/IP approach. The complexity of these designs can range anywhere from homogeneous to hetrogenous in nature as shown in Figure 1.8.

Homogeneous topologies are typically the design choice for High Performance Chip

13 Figure 1.9: SoC Design Space

Multiprocessors. Traditionally, bus-based architectures have been used to intercon- nect small number of IP cores in SoC design. However with increased number of components or cores on a single die, the bandwidth requirement between the cores in

SoCs is increasing as well. To overcome the increased communications demands, the bus-based architectures have evolved over time from a single shared bus to multiple bridged buses and to crossbar-based designs to an extent. However, despite these im- provements shared bus approach is not suitable for SoC designs containing thousands of such cores. These problems arise from non-scalable global wiring delays, failure to achieve global synchronization, and difficulties associated with non-scalable bus-based functional interconnects. Bus based architectures are inherently non-scalable. More and More components introduce an increased load on the bus and as a result speed drops drastically. The interconnect complexity of current and future SoC design, not only requires scalability, but also require reliability, performance and reusable inter- connect architectures to increase productivity. Thus a network based interconnect architecture or NoC is needed to overcome this communication bottleneck and the

14 other associated challenges. NoC has been shown to improves on chip communication

through the aid of specific interconnection topologies and packet based communica-

tion. NoC is scalable by nature and has huge potential to handle growing complexity

of SoC designs.

1.4 NoC Design Considerations and Challenges

The concept of NoC is inspired from the success of computer networks. The main

idea behind NoC is to Route Packets not Wires to ease on chip communication chal- lenges. Although the idea is borrowed from this well-established domain of computer networking, but it is not possible to reuse all the features of this classical network for on chip implementation. In particular, NoC switches should be small, energy efficient, and fast in contrast with routers used in computer networks. A tremendous research efforts both by industry and educational institution are being put in this direction to properly model and enhance NoCs for it to be practical and feasible in the future

SoCs or MpSoC (Multi-Processor System on Chip) designs containing hundred or thousands of cores. This thesis is one such effort in hopes to serve for the same goal.

As one can imagine, there are many design considerations and challenges involved in the design process for NoCs. The design issues span several abstraction levels, ranging from high-level application modeling to physical layout level implementation.

For example, one of the main challenge at architecture level is to find the best suit- able interconnect topology to satisfy, meet or exceed the design goal expectations set forth by the SoC system. The design choices made at any level in the design pro- cess beginning from architecture, can have a strong impact on the fesiability of the

15 network, timing closure and overall system performance. Some of the main design considerations and issues in NoC modeling are as follows

• System Level Design Considerations Designing an efficient NoC archi-

tecture, while satisfying the application performance constraints is a complex

process. For NoC communication, many different topologies or configuration

schemes are possible for interconnecting network switches with the cores and

with each other. The choice of NoC architecture and its design can have a large

impact on the performance, power consumption, throughput, latency and effi-

cient usage of area on the silicon chip. At architecture level, the main task thus

is to identify an appropriate topology based on the design needs and constraints.

Many possible solutions have been proposed and implemented recently, but too

often on-chip networks are built using mesh and ring topologies. In fact, most

of the NoC implementations to date utilized mesh topology or its derivative due

to its simplicity and scalability. However, other topologies must be investigated

for a wide range of applications with different design goals in terms of area,

power and bandwidth.

• Power Budget The performance of any NoC design is highly bounded by the

power consumption. NoC communication will not only be applied to high end

designs like servers and desktop applications, but also to very small devices

like mobiles and/or other wireless communications devices. The NoCs for high-

end applications, such as and home entertainment server, need

low power design because of the associated thermal issues which may require

expensive packaging and cooling equipments. Similarly, NoCs for mobile and

16 wireless applications may have more stringent low power requirements to guar- antee a reasonable operation time with a limited battery, because more powerful applications such as 3D graphics games, navigation and image recording and processing applications are being implemented in the hand held devices, which are communication intensive and power hungry applications. Today some of the most powerful microprocessor chips can dissipate as much as 100-150 Watts for an average power density of 50-75 watts per square centimeter. Local hot spots on the die can be several times higher than this. Figure 1.10 shows growing power density trend in Intel Microprocessors. As mentioned earlier, the in-

Figure 1.10: Power Density in Intel´sMicroprocessors

creased power density not only presents packaging and cooling challenges, but also can pose serious problems associated to reliability. The mean time to failure decreases exponentially with temperature, every increase of 10C in operating temperature cuts product lifetimes in half. In addition performance of the chip

17 degrades with temperature and leakage power increases with temperature. Un-

til recently power has been a second order concern in chip design, following first

order issues such as cost, area and timing. Today, for most SoC designs, the

power budget is one of the most important design goals of the project. Exceed-

ing the power budget can be fatal to a project, whether it means moving from

a cheap plastic package to an expensive ceramic one, or failing to meeting the

required battery life. For virtually all applications, reducing power consumed

by SoCs at every level in the design including NoC is essential in order for it to

continue add performance and features.

• Interconnect-dominated Nanometer Design Interconnect play an impor-

tant role in any on-chip communication including NoC. The success of NoC

greatly depends on the performance of its interconnects. With ever shrink-

ing geometries, gate delays are lowering, but global interconnects are becoming

the principal performance bottleneck in terms of communication latency, cost

and power. Interconnects in deep nanometer regime pose severe challenges to

meet targeted system performance and reliability. Figure A.2 shows wiring

delay vs. gate delays (From ITRS 2007 report). In technologies 90nm or be-

low, wiring capacitance dominates gate capacitance, thus rapidly leading to in-

creased interconnect-induced delays. Moreover, coupling capacitance becomes

significant between adjacent wires due to tighter geometries and and must be

accounted for in advance. Interconnect optimization must be considered at all

levels of design abstraction in NoCs. In conventional IC design flow much em-

phasis is given to device and logic optimization, while the interconnects are left

for automatic layout tools. As a consequence, traditional topdown approach

18 Figure 1.11: Gate and Wiring Delay Vs. Future Technology Nodes [5]

taken in any traditional VLSI design may not be an effective approach for NoC

designs. In an interconnect-centric design such as NoC, a careful modeling in-

cluding interconnect planning, interconnect synthesis, and interconnect layout

with a focus on interconnect optimization is essential.

• NoC Quality-of-Service (QoS) Challenges and Cost The challenge of

designing a NoC lies in finding a balance between the NoC services and their

implementation complexity and cost. In NoC, QoS refers to the level of commit-

ment for packet delivery; such a commitment could be in the form of correctness

of the transfer, completion of the transaction, or bounds on performance. In

most cases, however QoS actually refers to bounds on performance (bandwidth,

delay or latency etc.) since correctness and completion are the basic require-

ment of on-chip packet transfers. Correctness is concerned with packet integrity

19 (corruption-less) and in-order transfer of packets from source to intended des-

tination. This is achieved by different techniques such as error correction, en-

suring in order packet delivery by reordering the packets. Completion requires

that packets are not dropped or lost when being transferred from the source

to destination. In terms of bounds on performance, QoS can be classified into

three basic categories; best effort (BE), guaranteed service (GS), and differen-

tiated service (DS). In best effort service, only correctness and completions of

communication are guaranteed, and no other commitment is provided. Pack-

ets are delivered as quickly as possible over a connectionless (packet switching)

network, but worse case times are not known or provided, and can be orders

of magnitude worse than average case. A GS, such as guaranteed through-

put (GT), makes a tangible guarantee on performance , in addition to basic

guarantees of completion and correctness. GS is typically implemented using

connection- oriented (circuit switching)switching. A DS priorities communi-

cation according to different categories probably through NoC switches, which

can employ priority based scheduling and allocation policies. All these solutions

require pre-planning as per design constraints, as most of them increase power

consumption and the cost of NoC [10].

• Lack of Tools and Benchmarks- NoC design space is enormous, with nu-

merous topologies and protocols/parameter choices, switching strategies, flow

control, congestion control schemes, buffer sizing, packet sizing, and link sizing

etc. Due to being in early stages of research, the area still lacks in design space

exploration and implementation tools [10]. The NoC design flow needs to be

integrated with the industry standard automation tool flows. There is a need

20 for open benchmarks [11] to compare among different NoC designs for perfor-

mance, cost, reliability and many other features. Computer-aided synthesis of

NoCs is particularly important to design and select the best performing design

in a reasonable amount of time.

1.5 Organization of this Thesis

The rest of this thesis is organized as follows. In Chapter2, an overview of NoC architectures along with its performance and cost models is presented. Chapter3, presents NoC router design, its impact on the overall power consumption, area and system performance. In addition, some low power design techniques applicable to

NoC router design are discussed and evaluated to achieve low power router design.

In Chapter4, the impact of technology scaling on the NoC interconnects is discussed and some schemes to optimize performance are presented. Additionally, an efficient design flow based on interconnect modeling is presented to achieve high performance and low power NoC interconnects. In chapter5, high level power models for different

NoC architectures are presented to estimate power budget in the eraly phase of design cycle. Chapter6, presents an interesting study to achieve low power NoC design based on the activity numbers. Asynchronous vs. Synchronous design is evaluated.

21 CHAPTER 2

NoC Overview : Architecture, Performance and Cost

NoCs are now being considered by many as a viable alternative to design scalable

communication architecture for present and future generation of SoC [32] designs. In

multimedia processors, inter-core communication demands often can scale up in the

range of GB/s and this demand is expected to peak with the increasing integration

of many heterogeneous/homogeneous high performance cores into a single chip. To

Figure 2.1: Conceptual view of Network-on-Chip [6]

meet such increasing bandwidth demands, state-of-the-art buses such as AMBA and

CoreConnect have been instantiated using multiple buses operating in parallel thereby providing a crossbar-like architecture, which still remains inherently non-scalable with

22 low performance. To effectively tackle the interconnect complexity of modern SoC designs, a scalable and high performance interconnect architecture is needed and hence, NoCs [12].

2.1 NoC Building Blocks

The most important components forming the NoC architecture are the Network

Interfaces, the Routers and the Links. An NoC is formed by interconnecting these network elements in different configurations to form a topology. A sample topology is shown in Figure 2.2. The topology may either be specific, such as a mesh or a ring, or arbitrarily connected to match the requirements of the target application.

A network interface includes packetizing and de-packetizing logic for packet based

Figure 2.2: Network-on-Chip

communication. The arbitration for different flows happens at the routers, which decides which master/source gets the priority on the links downstream. These basic building blocks of NoC are explained in more detail in the following sub-sections benini.

23 2.1.1 Network Interfaces

A Network Interface (NI) is needed at each node to connect the IP(core) to the

NoC. Network Interfaces convert transaction requests into packets for injection into the network and receive packets in response to other transactions in the network.

When transmitting the packets, they are split into a sequence of FLITS (Flow Control

Units), to minimize physical wiring in the network. This flit width can be static or configurable based on the system requirements. For example, a flit width can vary from 4 wires to even 200 wires including data and control lines depending on the needs of the system. Network Interface also provide buffering at the interface to improve network performance.

2.1.2 Switches

The medium of transportation of packets in the NoC architecture are the switches, which route packets from sources to destinations. Switches are fully parameterizable in the number of input and output ports. Switches can be connected arbitrarily and hence any topology, standard or custom can be configured. A crossbar is used to connect the input and output ports of a switch. The switches are also equipped with an arbiter to resolve conflicts among packets from different sources, when they overlap in time and request access to the same output link. An arbiter most often is implemented using either the round-robin or the fixed priority scheduling policy.

In switches input and output buffering is used to avoid deadlock, lower congestion and improve performance [13]. The buffering resources are instantiated depending on the desired flow control protocol. If credit-based flow control is chosen, only input buffering is necessary. Output buffers can still be deployed to decouple the

24 propagation delays within the switch and along the downstream link. The downside to this is second cycle of latency and additional area and power overhead.

2.1.3 Links

In NoC there are two major signal paths, namely from router-to-router and from router-to processing element (PE). Compared to the short-length local wires inside

PEs and routers, the global wires between routers and the semi-global wires between a router and a PE form a critical part of NoCs. As semiconductor technologies are shrinking, NoC links play a major role in overall system performance, and is an area of active research today.

2.2 NoC Architectures

An NoC architecture (or topology) specifies the physical arrangement of the in- terconnection network. It defines how nodes (IPs), switches (routers), and links are connected to each other. The nodes may be of same type e..g processing cores or different types e..g audio cores, video cores, wireless transceivers, memory banks etc as shown in Figure 2.2. Each IP is connected to local router through a Network In- terface (NI) module. The NI module packetizes/depacketizes the data into and from the interconnection network. The PE together with the its NI forms a network node.

Nodes communicate with each other by injecting data packets into the network. The packet traverse their destination, based on various routing algorithm and control flow mechanisms. NoCs architectures can be classified into three broad categories - Direct networks, indirect Networks, and irregular networks [10]. There are 11 standard topologies applicable to Noc, and are shown in Figure 2.3. For this research, only 1

25 (a) Cliche (b) Torus (c) Folded Torus

(d) Octagon (e) Ring (f) Spidergon

(g) Binary Tree (h) BFT (i) SPIN

(j) Hypercube (Hcube) (k) Star

Figure 2.3: Eleven Standard NoC Topologies

26 or 2 topologies from each group type are selected to compare and contrast and are

discussed in more detail below

2.2.1 CLICHE´

Kumar et al. [14] proposed Chip Level Integration of Communicating Heteroge- neous Elements (CLICHE)´ topology. It is a 2D mesh consisting of m × n mesh of

switches; interconnecting Intellectual Property (IP) Elements. A mesh consisting of

16 IPs is shown in Figure 2.4. Every switch except those at the edges are connected

Figure 2.4: CLICHE´ Architecture

to four neighboring switches and one IP block. In a 2D mesh number of switches is

equal to number of IPs. The switches and the IPs are connected through communi-

cation channels. A channel consists of two unidirectional links between two switches

or between a switch and a resource. CLICHE´ topology is being widely used in NoC designs because of its simplicity, regular structure and shorter inter switch wires, thus making it more suitable for tile based architectures. In this topology, under utilization of links may result in the event of localized traffic, because in some particular cases

27 not all PEs may have the same communication requirement. This leads to mapping inefficiencies and wastage of resources.

2.2.2 TORUS

Dally and Towels [15] proposed 2D Torus as shown in Figure 2.5 for NoC archi- tecture. A 2D torus is basically the same as a regular mesh; except that switches at the edges are connected to the switches at the opposite edge through wrap-around channels. Every switch has five ports, one connected to the local resource and others connected to closest neighboring switches. The number of switches in a torus topology is equal to number of PEs. The main drawback associated with this topology is the

Figure 2.5: Torus Architecture

longer end around connections, as they can yield excessive delays. However, this can be avoided by folding the torus, which leads to a more suitable VLSI implementation.

Some of its advantages over a mesh based architecture are: (i) Smaller hop count, (ii)

Higher bandwidth, (iii) Decreased contention with (iv) Optimized chip space usage.

28 2.2.3 BFT

Pande et. al [16] proposed Butterfly Fat Tree (BFT) as a NoC topology as shown

in Figure 2.6. A BFT architecture, is a modification to the fat tree architecture. In

this network, the IPS are placed at the leaves and switches placed at the vertices.

Each switch has four child ports and two parent ports. The IPs are connected to N/4

Figure 2.6: BFT Architecture

switches at the first level. The number of levels depends on the total number of IPs

i..e for N IPs, the number of levels will be log4N. In the jth level of the tree there

are N/2j+1 switches. The number of switches in the butterfly fat tree architecture converges to a constant independent of the number of levels. If we consider a 4-ary tree, as shown in Figure 2.6 with four down links connected to child ports and two up links connected to parent ports, then the total number of switches at level 1 is N/4.

The most common drawback of a tree based topology in general is that the root node

or nodes close to it become a bottleneck. However, the bottleneck can be removed

by allocating a higher channel bandwidth to channels located close to root nodes.

29 2.2.4 SPIN

Guerrier and Greiner [17] proposed a generic architecture called Scalable, Pro- grammable, Integrated Network (SPIN) for NoC communication network. SPIN makes use of fat tree topology. Every node has four children, and the parent node is replicated four times at any level of the fat tree. A basic SPIN architecture with 16 nodes (IPs) is shown in Figure 2.7. This topology has some redundant paths, there-

Figure 2.7: SPIN Architecture

fore offers higher throughput at the cost of added area. It is scalable and uses small number of routers for a given number of IPs. It has a natural hierarchical structure which may be suitable for some particular applications.

2.2.5 OCTAGON

Karim et al. [18] proposed Octagon architecture for NoC. A basic Octagon config- uration includes eight nodes and 12 bidirectional links as shown in Figure 2.8. Each node is associated with an IP and two neighboring switches. Communication between

30 any pair of nodes, takes in at most two hops within a basic Octagon unit. It is a scalable architecture, and for a system containing more than eight nodes, an Octagon architecture is extended by interconnecting multiple basic Octagon units with a single node in common. The main disadvantage of this topology is that, if only one more

Figure 2.8: Octagon Architecture

node is needed beyond an 8 node cluster, 8 more nodes are added to the design to make it a complete instead of just one or few nodes. Still, in some cases this topology may prove to be useful.

2.3 NoC Flow Control Protocols

Flow control allocates network resources to the packets traversing the network and provides solution to network contentions. In NoC, flow control is important as it determines : (a) the number of buffering resources required in the system; an efficient flow control will minimize the number of required buffers and their idle time and (b) the latency that packets incur while traversing the network, which is useful under heavy traffic conditions, where fast packet propagation with optimum resource utilization is the key for time sensitive data in the network. The flow control may

31 be buffered or bufferless. The buffered flow control is more advantageous in terms of lower Latency and higher Throughput. Four different types of buffered-based flow control protocols for NoC are:

• CREDIT Based, is a availability based flow control where an upstream node

keeps count of data transfers, and the available free slots are termed as credits.

Once a transmitted data packet is either consumed or further transmitted, a

credit is sent back. Bolotin et al. [19] used Credit Based Flow Control in QNOC,

a QoS based hardware efficient SoC integration mechanism for NoC.

• ACK/NACK, is a retransmission based flow control where a copy of transmitted

flit is kept in a buffer until an ACK/NACK signal is received. If an ACK signal

is received, the flit is deleted from the buffer and if a NACK signal is received the

flit is re-transmitted. Bertozzi et al. [20] used this flow control retransmission

in Xpipes, a NoC implementation.

• STALL/GO, is a simple variant of credit based flow control, where a STALL

is issued based on the status of the buffer downstream when there is no buffer

space available, else a GO signal is issued, indicating availability of buffer space

to accept the next transaction. In this scheme two wires are used for the flow

control between each pair of sender and receiver.

• HANDSHAKING Signal, is a message based flow control where a VALID signal

is sent whenever a sender transmits any flit. The receiver acknowledges by

asserting a VALID signal after consuming the data flit. Zeferino et al. [21] used

handshaking signals in their SoCIN NOC implementation.

32 2.4 NoC Switching Techniques

In NoC switching technique determines how data flows through the routers in the network. It defines the granularity of data transfer and its applied mechanism.

Messages generated by the source node are partitioned into several data packets. A packet is further divided into multiple flits (Flow Control Unit). A flit is an elemen- tary packet on which link flow control operations are performed, and is essentially a synchronization unit between routers to ensure each data transfer. Each flit is made up of one or more phits (physical unit). A phit is a unit of data that is transferred on a link in a single clock cycle. Typically, the size of phit is the width (in bits) of the communication link [10]. Different NoC architectures use different phit, flit, and packet message sizes. The choice of size can have a significant impact on cost, per- formance, and power for NoC fabrics. As shown in Figure 2.9, the two main modes

Figure 2.9: NoC Switching Techniques

of transporting flits in a NoC are Circuit Switching and Packet Switching. These two techniques are discussed in more detail next.

33 1. In Circuit Switching, a physical path between the source and the destination

is reserved prior to the transmission of data. The physical path consists of a

series of links and routers, and the message is sent in its entity to the receiver

once a path(circuit) is established. A message header flit traverses the network

from the source to the destination, reserving links along the way through the

routers. If the header flit reaches the destination without any conflict, all the

links in the path are available and an acknowledgment is sent back to the sender

from the receiver. Upon receiving a confirmation, sender sends out the data on

the reserved path. The path is held until all the data has been transmitted. At

the end, a tail flit frees the resources for other transmissions. If a link is busy,

however a negative acknowledgment is sent to the sender for further actions.

The main advantage of this approach is that the full link bandwidth is available

to the circuit once it has been setup, which results in low latency for data

transfer. On the other hand the main drawback of this approach is that, it is

not scalable with the size of the network as several links can be occupied for

the duration of the transfer. Circuit Switching is implemented in the SoCBUS

NoC architecture [22].

2. In Packet Switching no path is reserved before sending any data, the pack-

ets are transmitted from the source and they make their way independently to

the receiver. In circuit switching, there is a start-up waiting time for the data

for path reservation followed by fixed minimal latency in the routers, whereas

Packet switching has zero start-up time followed by variable latency due to con-

tentions in the routers. There are three different packet switching techniques:

34 (i) Store and Forward (SAF), (ii) Virtual Cut Through (VCT), and (iii) Worm- hole (WH) switching. These are discussed in more detail below

(i) In Store and Forward (SAF) Switching, a packet is sent from one

router to the next only if the receiving router has buffer space for the

entire packet. Hence, packet transmission can not stall and there is no

concept of flit (flits are equal to packets). Routers forward a packet only

when it has been received in its entirety. The buffer size in the router

is at least as big as the size of the packet is. Because of large buffer

size requirement, this technique is not commonly used in NoCs. However,

Nostrum [14] NoC makes use of this technique.

(ii) In Virtual Cut Through (VCT) Switching a flit of a packet is for-

warded as soon as space for entire packet is available in the next router,

thereby reducing the per router latency. The other flits then follow with-

out delay. If no space is available then whole packet is buffered. Buffering

requirements of SAF and VCT switching are the same.

(iii) In Wormhole(WH) Switching buffer requirements are reduced to one

flit, instead of an entire packet. A flit form a packet is forwarded to

the receiving router if space for that flit is available in the router. In

this scheme, a packet in transit could be distributed among two or more

routers at a given time. This blocks the link, which results in higher

congestion than with SAF and VCT switching. However link blocking

can be alleviated by multiplexing virtual links (or virtual Channels). WH

is also more susceptible to deadlocks, due to interdependencies among

35 routers, but it can also be avoided by virtual channels and routing schemes.

Figure 2.10 shows the major difference in SAF and WH routing in terms

of buffer and timing needs. Due to less buffering requirements, almost all

NoC use WH switching.

Figure 2.10: Store & Forward Routing Vs. Cut-Through Routing

2.5 NoC Routing

The NoC routing is responsible for correct and efficient routing of packets(or cir- cuit flows) that are traversing the network from sources to destinations. It determines the paths over which data follows through the network. There are several different routing schemes in the research literature, but some of the major schemes are (i)

Static or Dynamic, (ii) Distributed or Source and (iii) Minimal or Non-Minimal rout- ing. These are discussed in more detail below

(i) In Static Routing (also known as oblivious or deterministic) scheme perma-

nent paths from a source to a destination are defined and are used regardless

36 of the current state of the network. This routing scheme does not take into

account the current load of network links and routers when making routing de-

cisions. Whereas, in a Dynamic Routing scheme, routing decisions are made

according to current state of the network (load, available links). As a result,

the traffic between a source and a destination may change its route with time.

Static Routing is simpler to implement in terms of router logic and interactions

between routers. A major advantage of a single path static routing is that all

packets with the same source and destination are routed over the same path

and can be kept in order. In this case, there is no need to number and reorder

the packets at the network interface. Static Routing is more appropriate, when

traffic requirements are steady and are known and Dynamic routing is appro-

priate when there is more and traffic conditions are unpredictable [23]. Both

Static and Dynamic routing techniques can be further classified based on where

the routing information is held and where routing decisions are made.

(ii) In Distributed Routing, each packet carries the destination information. The

routing decisions are implemented in each router either by looking up routing

tables or by executing a hardware routing function. In this method, each router

in the network contains a predefined routing function, whose input is destination

address from the packet and its output is routing decision. When a packet arrives

at the input port of the router, its output port is looked up in the table or cal-

culated by the routing logic according to the destination address [23]. However,

such an implementation is only practical for very small systems because table

size increases linearly with network size. In a Source Routing scheme, source

nodes predetermine complete routing paths before injecting packets into the

37 network. The pre-computed path is stored in the message header and switches

simply read the routing information. As for the implementation, it is by means

of routing tables (or look up tables) stored at the end node. This solution

allows to route messages in any irregular topology configuration, as header in-

cludes the entire path. However, this schemes consumes network bandwidth

as it is transmitted through the network. Additionally, the size of the look-up

tables at every end node grows linearly with the system size, and quadratically

with respect to NoC size citeflich. Examples of real NoCs using source based

routing implementations are xpipes [24] and Intel Polaris Chip [25].

(iii) Based on the number of hops a message takes, routing schemes can further be

classified as Minimal and Non-Minimal distance routing. A routing is minimal

if the length of the routing path from the source to destination is the short-

est possible length between the two nodes. In a Minimal Routing scheme,

a source does not start sending packets if a minimal path is not available. In

contrast, a Non-minimal Routing scheme, there is no such constraint, and

messages can take longer paths if minimal path is not available. By allowing

non-minimal paths, the number of alternative paths is increased, which can be

useful for avoiding congestion, hot spots and fault tolerance etc. However, the

non-minimal routing can have an undesirable overhead of additional power con-

sumption, and it can be prohibitively expensive for a large scale NoC design [10].

2.6 NoC Performance and Cost

The main function of the NoC design is to transfer information from any source node to any desired destination node. It should be able to accomplish this task in

38 as less time as possible, and it should allow large number of such transfers to take place concurrently. Thus, it is highly desirable for a NoC design to exhibit high performance (High Throughput, and Low Latency) with low cost (Low Power, and

Smaller Area overhead). As with any digital design, most of these metrics trade off with each other, and require a careful balance among all. A more detailed discussion on these metrics is presented below

2.6.1 NoC Power Dissipation

Until recently, power has been a second order concern in chip design, following

first order issues such as cost, area and timing. Today, for most SoC based designs, the power budget is one of the most important design goals of the project. Exceeding power budget can be fatal to a project, whether it means moving from a cheap plastic package to an expensive ceramic one, or causing an unacceptably poor reliability due to excessive power density, or failing to meet the required battery life. These problems are expected and/or becoming worse for process geometries 90nm or below.

For example, leakage power alone has become a significant part of the total chip power; reaching almost 40% in generic 65nm technology. Reducing power wherever possible is almost essential for any design including NoCs. The concept of NoC is not only expected to be applied in high end SoC designs but also in very small devices like mobile and wireless communication devices. For virtually all applications, reducing the power consumed by NoC is essential in order for it to be feasible and successful. In

NoC, power is dissipated, when flits travel in the network, both inter switch wires and the logic in the switches toggle. A high level power models and an IP based efficient design methodology to achieve low power NoC design is presented in Chapter5.

39 2.6.2 NoC Area Overhead

Area consumed on silicon directly relates to associated cost of the design. In NoC design, the silicon area overhead arises due to the presence of the switches, large num- ber of global interconnects and repeaters. In longer interconnects repeater insertion is necessary to manage the inter switch delay within one clock cycle. The total number of repeaters required depends on the length, and total number of interconnects in the network. Additionally, the spacing between interconnects usually is optimized for sig- nal integrity issues in deep nanometer technologies. Repeater insertion in addition to optimized interconnects may result in huge area overhead by the NoC. Similarly, NoC switches have two main area sensitive components, the storage buffers and the control logic to implement routing and flow control. The storage buffers are the FIFOs (First in First out) necessary to maintain network performance mainly trade offs with the area. As the SoC design scale the area overhead may become very large or impracti- cal to implement. It has been reported that future SoCs may have as many as 1000

IPs on a single chip. Keeping this in mind, area minimization techniques are highly desirable at circuit level design. A detailed discussion on performance optimization of interconnects for an area limited NoC design is presented in Chapter4.

2.6.3 NoC Message Latency

Message latency is the time elapsed since a message is generated at its source node until that message is delivered at its destination node. An unloaded or Zero load latency of a network is the latency where only one packet traverses the network.

This model does not consider contention among packets. The zero load latency of an

40 NoC with wormhole switching is

L T = N .t + t + p (2.1) network hops sw links B

Where, first term represents the routing delay, tlinks is the total propagation delay of communication channels or inter switch links, and the third term is the serialization delay of the packet [26]. In more details, Nhops is the avg. number of hops a packet traverse to reach the destination node. tsw is the switch delay and can be calculated using N t = sw−cycles (2.2) sw f

Where Nsw−cycles is the number of clock cycles required for packet processing by the switch and fsw is the switch frequency. Lp is the length of packet in bits. B is the

bandwidth of the and is defined as B = wc.fc. Where wc is

the channel width in bits and fc is the channel frequency. Almost all cases of NoC

have fc = fsw, and with this assumption NoC latency can thus be defined as

Nsw−cycles Lp Tnetwork = Nhops. + tlinks + (2.3) f wc.f

Zero load network latency describes the effect of network topology to the timing

performance. In a Multi-processor SoC (MpSoC) design, message latency directly

affects processor idle time and memory access time.

2.6.4 NoC Throughput

Typically, the performance of a digital communication network is characterized by

its aggregate bandwidth in Bytes/sec, which is a static measure of network capacity.

However, in the case of an NoC it is more important at what rate messages can be

sent or completed by the network, therefore Throughput is more of an appropriate

41 metric for NoC. Throughput of a network is defined as the total number of messages handled by the network per unit time [26]. It is the average rate measure of successful message delivery per unit time, and can be defined as

L T hroughput = tot−msg (2.4) NIP .ttotal where

N t = cycles (2.5) total f thus

L T hroughput = tot−msg × f (2.6) NIP .Ncycles

where Ltot−msg is the total length of the messages measured in bits, that have suc- cessfully reached their destinations. NIP is the total number of network IPs involved in the process, and ttotal is the total time elapsed between the generation of the first message and reception of the last message. It depends on Ncycles ( the number of cy- cles consumed) and the frequency ”‘f”’ of the switch. If the frequency of the switches is increased, the Throughput of network increases as well. The notion of virtual channels mentioned previously helps in achieving higher Throughput by increasing channel utility. Thus, NoC throughput can be improved by making use of virtual channels and by operating switches at higher frequencies. These factors are discussed and explored in more detail in Chapters3 and4.

42 2.7 High-level Physical Characteristics of NoC Architectures

The performance and associated cost of a NoC architecture is directly related to its physical interconnection structure i..e its topology. Each topology offers unique combination of performance (in terms of throughput and latency) and cost (power and area requirement) characteristics. As the number of topologies that can be considered for a NoC design increases, the need to predict the capabilities of each topologies arises. To choose an appropriate topology based on some theoretical information, it is necessary to understand topologies as a graph of nodes and links and the related physical characteristics it offers. Some of the network concepts related to macro networks,such as Switch-Degree, Diameter, Link cost and Average-distance etc. are also applicable to NoC architectures. It is possible to utilize these physical attributes or properties of the networks to perform a high level comparison among different NoC topologies. These parameters are discussed in more detail as follows.

• Number of Switches: is defined as the total number of switches required to

fully interconnect all the nodes with a particular topology. Every topology in

NoC require different number of switches for same number of nodes.

• Switch Degree: is defined as the total number of input/output ports of a

switch. The operating frequency of a switch and its area requirements are

strongly related to this property: the higher the switch degree, the lower the

switch maximum operating frequency and the higher its area cost.

• Network Diameter: is defined as the minimum distance between the farthest

nodes in the network. It defines the maximum routing distance in a minimum

routing scheme. This property, however is completely dependent on the the

43 physical implementation of the topology. Therefore to be more precise for on-

chip networks, it must be defined as the maximum number of cycles between

two cores. Higher the value of this property, the longer messages take to reach

their destinations.

• Link Cost: is the minimum number of links required to fully interconnect

all the nodes in the topology. It defines the wiring cost associated with the

topology. Number of links in NoC is not a critical resource, but associated

delay of a link is a critical factor. Longer interconnects are dependents on link

delay, as they may require several pipeline stages to meet certain target system

frequency. Since the real link delay is dependent on the layout and technology

library used, it is very difficult to provide an accurate estimation of delay based

on link cost. On the other hand, if layout mapping is available this parameter

can be used to get a good implementation cost in terms of metal resources and

number of repeaters required for a given particular topology.

• Bisection Bandwidth: is defined as the smallest aggregated bandwidth ob-

tained by dividing the topology into two equal halves. It is a common measure

of theoretical network performance: the higher the bisection bandwidth, the

better the topology is suited to cope with high traffic loads. Since the primary

goal is to judge bandwidth and the resulting wiring demand; only data lines are

considered in the estimation process [27].

• Average Distance: is the average distance measured in hops by calculating

distance from each node to every node in the network. Higher the value of this

44 parameter, higher is the communication cost in terms of power. More power is

consumed with more distance traversed by the messages.

• Symmetry: A topology is symmetric when the network looks the same from

every switch. A symmetric topology offers more communication path, but at

the same time some of the areas may be under utilized if the traffic is not evenly

distributed.

2.8 NoC Design Flow

NoC design flow consists of a sequence of design activities [28][29][30]. The exact set and order of activities is always product specific. Designing NoCs, so that they meet SoC design requirements is a complex process and requires a careful design strategy. The design choices made at each step will have a significant impact on the overall system performance and cost. Some of the most important phases in designing

NoC include:

• Application Description - This phase is responsible for defining the communi-

cation needs of the system, such as frequencies or bandwidth etc. The general

characterization is done by means of a graph, where each vertex represents the

computation module or IP in the application referred to as task and the edges

denotes the inter dependencies between the tasks. Alternatively a table can be

also be used to represent the applications communication requirements.

• Topology Selection - This phase of design cycle involves exploring various NoC

topologies for design objectives such as communication delay, area and power

consumption etc. The design choices span from standard regular topologies to

45 full custom topologies. The designer could even adopt a hierarchical or a mixed

topology scheme to satisfy the system requirements [6].

• IP Mapping - It is the process of mapping a given set of IP cores onto the tiles

of the selected communication architecture to satisfy the design requirements.

Many different types of have been proposed to achieve efficient map-

ping of IP cores in NoC architectures; based on Bandwidth, Latency and Power

awareness etc.

• Architecture Configuration - It involves selecting routing and switching schemes,

and fixing buffer sizes etc. Many different heuristic based design techniques

exists to select the values that best suit the architecture’s communication needs

and result in near optimal solution.

• Design Synthesis and Validation - This involves writing the network components

in Hardware Description Language (HDL) using the Synthesis tools. In this

phase standard component libraries for switches and network interfaces can

be used. The cost and performance numbers are obtained from the simulations

and are dependant on the selected network components and their corresponding

configurations. Design Validation of NoC implementation is an important step

to verify the design against initial requirements in terms of communication

latencies, throughput, area and Power.

In order to handle the design complexity and meet the tight time-to-market con- straints, it is important to automate most of these NoC design phases. To achieve design closure, the different phases should also be integrated in a seamless manner.

46 2.9 Summary

Developing a communication system with several or hundreds of processor like resources is a formidable task and involves a careful design considerations. There are many factors that may affect the choice of an appropriate interconnection network topology/architecture for the SoC design. At the architectural level, design space ex- ploration with modeling and synthesis task is a must to fully understand the impact of the selected architecture on the overall performance and cost of the design. There are many parameters in the architectural design phase which can affect the key-trade- offs between performance and power dissipation; such as the length of physical wires, switching techniques employed, buffer allocation, routing algorithms, types of ser- vice(guaranteed/not guaranteed)level, and the implementation of the topology itself.

In this chapter, four main NoC architectures (Cliche, BFT, SPIN and Octagon) are discussed along with their performance and cost models. Some high level architec- tural parameters for various NoC topologies are presented for a generic comparison only.

47 CHAPTER 3

NoC Router Architecture Design and Cost

A router (or a Switch) is one of the main building block in a NoC network. The main function of a router is to make routing decisions and to forward packets that arrive through the incoming links to the proper outgoing links. The high level design of a generic switch required to implement a packet based communication is shown in

Figure 3.1. It mainly consists of input/output FIFO buffers, input/output arbiters,

Figure 3.1: A Generic Router Design

MUX units and control logic [31]. A switch in the network is connected to other switches and to an IP node through interconnects or links forming the channels in

48 the network. The router design critically affects the performance and cost of the whole network in terms of throughput, latency, power and area [32][33]. The number of input and output ports is generally small and is dependent on the architecture type. The time gap between when the information enters the input port to the time when the information leaves the switch through an output port is called switch-delay.

A detailed design of the switch architecture is presented in the following sections.

3.1 Main Parts of NoC Router

A baseline architecture for a NoC switch is shown in Figure 3.1. In this configu- ration, the incoming flits are received by the Link Controller (LC) and are stored in input buffers. The flow control logic is responsible for communicating buffer availabil- ity among neighbor switches, while routing logic determines their output destination based on the information in the header flits [34]. Each incoming packet is directed to the Header Decoder Unit (HDU) to determine its destination. The inputs that are allowed to send data over the crossbar are determined by the switch arbiter that resolves all conflicting requests for the same output ports. Optionally, the flits that cross crossbar may be stored in the output buffers. The operation of the switch thus consists of one or more of the following processes, depending on the nature of the flit.

In the case of a header flit, the processing sequence is: 1) input arbitration, 2) routing, and 3) output arbitration. In the case of body flits, switch traversal replaces the rout- ing process since the routing decision based on the header information is maintained for the subsequent body flits. A switch designed with Virtual Channels (VC), requires a VC allocator to select which output VC the input flits would use when leaving the switch. The design of the router is largely determined by the switching technique

49 supported. Majority of modern commercial routers found in high-performance mul- tiprocessor architectures utilize some form of virtual-cut through switching or its variant like wormhole switching. Some of the main components of a wormhole switch are as follows.

3.1.1 Input/Output Ports

The router for different network topologies require different number of ports. For example, a router for a two-dimensional mesh based topology consists of five in- put/output ports: four ports communicate with neighboring routers, and a fifth port is connected to a processing or storage unit through the network interface block.

Each port of the router is composed of input virtual channels, output virtual chan- nels, header decoder, a crossbar, an input arbiter and an output arbiter. In most

Figure 3.2: An Input Port of the Switch

implementations, the numbers of input and output ports are five; four inputs from the four cardinal directions (North, East, South and West) and one from the Network

50 Interface (NI). To increase channel utility, Partha et al. [13] proposed switches with virtual channels. In a virtual channel switch, each port of the switch has multiple parallel channels made up of buffers, which helps in increasing switch throughput.

Virtual channel are discussed in more detail in the next section.

3.1.2 Virtual Channels

The design of a virtual channel (VC) is an important aspect of NoC. A virtual channel splits a single channel into multiple channels, virtually providing more paths for the packets to be routed. It can range from anywhere from two to sixteen virtual channels. The use of VCs reduces the network latency at the expense of area, power consumption, and production cost. However, there are various other added advan- tages offered by VCs. Since VCs provide more than one output path per channel

Figure 3.3: An Input Port of the Switch with Virtual Channels

there is a lesser that the network will suffer from a deadlock; the net- work livelock probability is eliminated (these deadlock and livelock are different from the architectural deadlock and livelock, which are due to violations in inter-process

51 communications). Virtual channels have been shown to improve throughput of the switch, but they may increase latency when too many virtual channel are added. The router design was analyzed to find optimum number of virtual channel for the given design.

3.1.3 Buffers

These are FIFO buffers and are used for storing messages in transit. FIFO, acronym for First in First out, is a concept used to describe the behavior of a buffer.

As the name says, it works according to the first-come first-serve principle. A FIFO fundamentally consists of some storage elements from which data can be read or written to. The storage element can be an SRAM, DRAM, or a set of registers (D

flip flops) or any other form of storage. In most of the NoC implementation register based buffers are used for their small size and for less power consumption. In general, irrespective of the type of memory used, buffering basically helps in managing the data traffic during congestion of packets (increased traffic) and during contention of resources (a situation where two or more sources compete for the same resource at the same time). In a router based on buffer model, a buffer is associated with both the input physical channels and output physical channels. In this chapter, a router design based on FIFO buffer is analyzed for power consumption.

3.1.4 Crossbar Logic

Crossbar connects the inputs to the outputs of the switch. The crossbar is a non- blocking network in the sense that an unused input can always connect to an unused output without destroying the connections of other input/output pairs. The connec- tion realized by the crossbar is determined by the Switch Controller. The crossbar

52 is commonly built with a single multiplexer per output, as shown in Figure 3.4. The

Figure 3.4: A 3x3 Crossbar Implemented using a Multiplexer for each Output

multiplexer are controlled by their select lines, which are actually connected to the grant signals from the arbiters. There are many ways to implement multiplexers for a crossbar design, such as using (i) gates only, (ii) using pass transistors, (iii) using tristate inverters or by (iv) using smaller multiplexers instead of a big one. A cross- bar design is often represented as a 2-D array of transistors that short the input and output lines in order to achieve a connection. As shown in Figure 3.5, this is a sim- ilar approach, the only difference is the layout representation. The signals S[i,j] are used for controlling the tristate buffers for each column of the crossbar. In general, depending on the characteristics of the technology and the logic family being used, different multiplexer implementations may be best [35]. However, in a standard cell logic-synthesis based design environment, some of the multiplexer implementations may not be possible.

53 Figure 3.5: 2-D Implementation of a 3x3 Multiplexer based Crossbar

3.1.5 Input/Output Arbiter

An arbiter is required when multiple inputs use shared resources in the switch.

The arbiter is used for resolving conflicts among requests for the same resource and granting access to only one of them. Besides its basic functionality, the arbiter should grant access in some fair form, a criteria set by the design that provides equal service to the different requesters. An arbiter could have a fixed priority arbitration or variable priority, however to allow a fair allocation of system resources and to achieve high system performance, an arbiter should be able to change the priority of the inputs dynamically. The way priority changes is part of the policy employed by the system’s scheduling algorithm. For example, the most widely employed Round-robin policy states that request served in the current cycle will get lowest priority in the next cycle. A general Variable Priority Arbiter (VPA) can be implemented many ways, and some of the mainstream designs are

54 • A Priority-Encoding based VPA utilize Fixed Priority Arbiter (FPA) as a

main block and additional control logic to make it a VPA. The FPA block does

not allow any dynamic priority change and always position 0 has the highest

priority and position n−1 the lowest. The FPA produces a grant signal G and an

additional flag AG to indicate that at least one input request was granted [35].

t A grant is given to i h request when Ri = 1 and no other requests exist to

any position with index smaller than i. In order to support variable priorities

the arbiter utilize either more FPAs and a selection mechanism, or additional

circuits that help mimic the behavior of a variable priority arbiter design. Some

of the options are for this are(i) Exhaustive, (ii) Rotate-Encode-Rotate and (iii)

Dual-Path designs [36].

• A Carry-Lookahead-based VPA is built using carry-lookahead like struc-

tures. In this case the highest priority is declared using a priority vector P

that is encoded in one-hot form. As in the case of FPA a request is granted

when no other higher priority input has an active request. The main character-

istics of Carry-lookahead based VPAs is that they don’t require multiple copies

of the same circuit and they inherently handle the circular nature of priority

propagation [36].

• A Matrix Arbiter implements a Least Recently Served (LRS) priority scheme.

The matrix arbiter stores more priority bits than simpler VPAs and thus is able

to handle more complex relations among the priorities of each input. As the

name suggest matrix arbiter uses N × N matrix of bits to store the priority

values. Each matrix element M[i,j] of ith row and jth column records the priority

55 of i over j, the symmetric matrix elements M[j,i] should be set equal to zero, and the elements of the diagonal M[i,i] have no physical meaning. Thus only the n(n-1)/2 elements of the priority matrix are actually needed. A high level design for the Matrix Arbiter is shown in Figure 3.6.

Figure 3.6: A Matrix Arbiter Design

According to the operation of the matrix arbiter a requester will receive a grant if no higher priority requester is bidding for the same resource. Once the request succeeds, its priority is updated and set to be the lowest among all requesters.

The grant circuit is then used to allow only one virtual channel to access a physical channel. A separate arbiter can be used for the input and output ports. In a wormhole based switch design, when the granted input virtual channel stores one whole flit, it sends a full signal to the controller. If it is a header flit, the Header Decoder Unit (HDU) determines the destination. Based on the info controller checks the status of the output port. If it is available, the path between input and output is established. All subsequent flit of the corresponding packets are sent from input to output using the established path.

56 If more than one input ports try to access the same output port simultaneously,

an output arbiter is used to allow only one input port to access the output port.

3.1.6 Control Logic

The complexity of Control Logic (CL) depends on the routing and scheduling al- gorithm being implemented. Control Logic determines output port for each incoming packet and arbitrates among inputs if directed at the same output. The controller keeps track of the input and output virtual channels. The operation of the controller is illustrated through the state diagram shown in Figure 3.7. When the input virtual

Figure 3.7: State Diagram of Port Controller

channel is selected by the input arbiter, a complete flit is stored in the buffers and a full signal is sent to the controller. If it is a header flit, the controller then sends the enable signal to the header decoder and then header decoder determines the destina- tion. Once the destination has been determined, the controller checks whether the needed output port is available to accept a new flit. Each switch includes a status

57 register to indicate the availability of the output ports. If one of the virtual channel of the desired output port is available, then flits from the input virtual channel are forwarded to the available output virtual channel, and if it is not available then flits wait in the input virtual channel. A brief description about the states is as follows:

S1-Checking the availability of input virtual channels, S2-Storing a flit in the avail- able the input virtual channel, S3- Destination address calculation, S4- Checking output port availability, S5- Transferring data from the input to the output virtual channel, S6- Ending packet transmission and S7- Updating output storage register.

The algorithm to control 4 Virtual Channels is shown in Figure 3.8.

Figure 3.8: Control Flow Algorithm for Input Virtual Channels

58 3.2 Packet Format

Data that needs to be transmitted between source and destination is partitioned

into fixed length packets which are in turn broken down into flits or words. A flit is

the smallest unit of data that is transferred in parallel between two routers. A packet

consists of three kinds of flits the header flit, the data flit and the tail flit, that are

differentiated by two bits of control information. The header flit contains information

of the destination router for each packet [37]. Figure 3.9 shows the packet format for

a header and data flit. Each packet has first field for the flit type - header flit or

Figure 3.9: Packet Format

the data flit. For a header flit, second field contains the length of the address as it is a variable and is dependent on the number of IPs in the NoC. The third field is for the packet length and contains the information about the number of flits in the corresponding packet. The next two fields provide the source and destination addresses. Usually for a given design, the length of the flit is constant, but the total number of flits in a packet can vary. For example, for a NoC with 1024 IP blocks, 10 bits are required for encoding the source and destination addresses in binary and 4

59 bits are required for indicating the address field length. If k represents packet length, then there are 2k number of flits in the packet.

3.3 NoC Router Design and Cost

In NoC, routers are major sources of power consumption and area costs. To study its effect on the complete network, a set of router designs (with different number of ports and queue sizes) are implemented at transistor level and using Register

Transfer Level (RTL) using two different technology nodes. All the links between tiles are 32-bit wide, which is also the flit size of the design. Router is implementing using wormhole routing and every link can transport one flit per clock cycle. For the NoC design, it is possible to pre-calculate the lowest operating frequency that allows the NoC to meet all the bandwidth requirements for a given mapping. This is done by computing the aggregate bandwidth requirements of all communication

flows overlapping on every individual inter-tile communication and by dividing then it by the link width as shown in equation 1.

Aggregate Bandwidth Minimum Frequency = (3.1) Link Width

The Design-I of router (NoC port) is implemented in 90nm CMOS technology using

Cadence Design Tools at transistor level. Design-II of the router is implemented

using ASIC flow and ARM’s standard cell library in TSMC 65nm process with 1.0

Volt in Synopsys Design Environment. Design-I is basically a custom design and

Design-II is ASIC design. Most of the SoC designs for mobile application are designed

using ASIC flow, whereas SoC designs for high-end processors are developed using

custom flow to achieve efficiency in terms of power performance and area. Both design

60 provides different perspective, but are not compared against each other. Design-I is discussed next.

3.3.1 Router Design-I

The NoC port of the router is implemented using Cadence Design Tool at tran- sistor level in 90nm CMOS technology. The power dissipation of the NoC port is determined at 200 MHz frequency for the worst case data input patterns. Consid- ering a die size of 20mm × 20mm and a supply voltage (Vdd) of 1.2 V, the total power dissipation for different NoC topologies is calculated for different number of

IP blocks. A top level description of the inputs and outputs for the NoC port is shown in Figure 3.10. The implemented router design is composed of input FIFO

Figure 3.10: NoC Port I/O

buffers, output FIFO buffers, input/output arbiters and control logic. Both input and output buffering are common choices in wormhole routers. The components are not completely optimized but simplified to measure the energy consumed to transfer

61 a flit through the router. The choices for the router design are in fact consistent with many of the other NoC designs. Virtual channels consist of several buffers con- trolled by an arbiter and a multiplexer and which grant access to only one virtual channel at a time according to the request priority. By increasing the number of virtual channels, the complexity of the switch; especially the arbiter component is increased as shown in Figure 3.11. When the number of virtual channels is eight,

Figure 3.11: Power per Component

the area of the arbiter and multiplexer is 35.5% of the total area of the port. For a 8

VC design, an 8x8 input arbiter and 8x1 multiplexer are needed to control the input virtual channels. The 8x8 input arbiter consists of 8x8 grant circuit and 8x8 priority matrix. Similarly, a 4x4 input arbiter consists of 4x4 grant circuit and 4x4 priority matrix. The values of the grant signals are determined by the priority matrix. The number of grant signals equals to the number of requests and the number of selection signals of the multiplexer. The area of two 4x4 input arbiters is smaller than the area

62 of 8x8 input arbiter as shown in Figure 3.11. In the architecture, rather than using one multiplexer and one arbiter to control the virtual channels, two multiplexers and two arbiters are employed as shown in Figure 3.12. Consequently, the area required

Figure 3.12: High Throughput Arbiter Design

to implement the switch with this architecture is less than the area consumed by the switch without this considerations. In the design, the virtual channels are divided into two groups, each group controlled by one multiplexer and one arbiter. Each group of virtual channels is supported by one interconnect bus. This port architec- ture has a great influence on the switch frequency and the throughput of the network in comparison to the original switch. Also, the area of two 4x1 multiplexers is smaller than the area of 8x1 multiplexer. The frequency of the switch design is characterized with different number of virtual channel for different network topologies as shown in Figure 3.13. When the number of virtual channel is increased beyond four, the

63 Figure 3.13: Max. Frequency of Switch with different number of Virtual Channels

maximum frequency of the switch is decreased for the BFT architecture. Since the complexity of the switch is reduced with division, the frequency of the improved de- sign HT-design is better than the original design. Increasing the number of virtual channels has a direct effect on the traffic on the interconnects. Increased traffic ef- fects contention on the bus and therefore increases the latency. The throughput is still increased as more links are in the channels. Using the throughput equations and frequency of the switch, the throughput for various NoC architecture is calculated.

The variation of the throughput with the number of virtual channels for various NoC architecture is shown in Figure 3.14. The High Throughput (HT) architecture in- creases the throughput of the network by 46% for the BFT architecture. The increase in throughput is minimum for the SPIN architecture. The throughput decreases when the number of VC channels is more than six in Cliche and Octagon architectures. The increase in throughput for different HT architectures is presented in Table 3.1. The latency of the network depends on the frequency of the switch and the number of

64 Figure 3.14: Throughput vs. Virtual Channels for Different NoC Topologies

Table 3.1: Percentage Increase in Throughput for Different NoC Architectures

Architecture Increase in Throughput (%) HT-Cliche 23 HT-BFT 46 HT-Octagon 31 HT-SPIN 11

virtual channels. With the circuit optimization as described earlier, the latency of the various NoC architectures is calculated and is shown in Figure 3.15. The latency of the BFT architecture is reduced by up to 59% for eight virtual channels. However, a severe increase in the number of virtual channels could cause a severe increase in the latency of the network. The latency of HT-Octagon, HT-Cliche and HT-SPIN with six virtual channels is reduced by 42%, 37% and 10% respectively. Considering a die size of 20mm x 20mm and a system of 256 IPs, the power consumption for

65 Figure 3.15: Latency of NoC Topologies with Different Number of Virtual Channels

different NoC topologies is shown in Table 3.2. Since the interswitch links are short in Cliche, there is no need for repeaters within the interconnects. The BFT topology consumes the minimum area and power as compared to the other NoC topologies.

Power dissipation of the network increases rapidly with more number of IPs on the die. Therefore finding ways to lower power dissipation is a primary concern in high speed, high complexity SoC designs. In NoC the switch for different architectures has different number of ports. The switch of BFT has six ports, four children ports and two parent ports. Each port of the switch includes input virtual channels, out- put virtual channels, a header decoder, controller, input arbiter and output arbiter.

Each port can be used as either input port or output port. If the port is used as the input port, the input virtual channels, header decoder and crossbar are active.

If the port is used as output port, the output virtual channels are active. This in conjunction with sleep transistors can be used to lower the power consumption in the switch as shown in Figure 3.16. The sleep transistors (M1) disconnect the input

66 Table 3.2: Power Consumption for Different NoC Architectures

Architecture Number of Power Dis- Power Power Total Reps. sipation in Dissipation Overhead Power Switches of Reps (%) (mw) (mw) and Links (mw) Cliche 0 24448 1398 5.4 % 25846 HT-Cliche 0 23715 2796 10.5 % 26511 BFT 960 15664 1458 8.5 % 17122 HT-BFT 1920 15194 2916 16.1 % 18110 Octagon 3840 19861 1094 5.2 % 20955 HT-Octagon 7680 19265 2188 10.2 % 21453 SPIN 12288 32264 10613 24.8 % 42877 HT-SPIN 24576 31296 21226 40.4 % 52522

circuit from the supply voltage during the output mode. The sleep transistors (M2) disconnect the output circuit from the supply voltage during the input mode. The acknowledgment signals (Ackin and Ackout) provided by the control unit are used to control the stand-by transistors M1 and M2 respectively. According to the received values of the request signals (Reqin and Reqout), the control unit generates the ac- knowledgment signals to determine the operating mode of the port; input mode or output mode. Using the Cadence tools, TSMC 90 nm CMOS technology, the NoC port with sleep transistor is implemented on the transistor level [38]. Given a die size of 20mm x 20mm and supply voltage (Vdd ) of 1.2V, the total power dissipation and power dissipation per component is calculated and presented in Table 3.3 and Table

3.4 respectively.

67 Figure 3.16: NoC Port Design for Reducing Leakage Power

The change in the power consumption with the number of IP blocks for different network topologies is shown in Figure 3.17. The power consumption for different NoC topologies increases by different rates with the number of IP blocks. The SPIN and

Figure 3.17: Power Consumption for Different NoC Architectures

68 Table 3.3: Power Reduction Per Component using Sleep Transistors

Component Power Consumption Power Consumption Percentage (mW) with Sleep Mode Reduction (mW) (%) Input FIFO 3.618 0.1029 97.2 % Header Decoder 0.955 0.2157 77.4 % Crossbar 0.473 0.1274 73.1 % Output FIFO 3.562 0.1003 97.2 %

Table 3.4: Power Reduction of a Switch for Different NoC Architectures using Sleep Transistors

Architecture No. Power Consumption Power Consumption Percentage of of a Switch (mW) of a Switch using Reduction Ports Sleep Transistor (%) (mW) Cliche 5 32.03 20.34 36.5 % BFT 6 41.29 29.60 28.3 % Octagon 3 18.66 10.87 41.8 % SPIN 8 54.66 39.07 28.5 %

Octagon architectures have much higher rates of power dissipation increase. Power dissipation of SPIN is increased by almost two orders of magnitude when the number of IPs increased from 16 to 1024. The BFT architecture consumes the minimum power as compared to other NoC topologies making BFT more attractive as a power-efficient

NoC topology.

69 3.3.2 Router Design-II using ASIC Design Flow

The Design-II is basically the full router design in ASIC flow using 65nm technol- ogy. To analyze the power consumption of the router, a set of router designs (with different number of ports and VCs) are modeled in VHDL. Table 3.5 shows main inputs and outputs ports of a 1 virtual channel router design.

Table 3.5: Input and Output Ports of NoC Router

Signal Direction Width length Clk in 1 Reset in 1 req put (1 to 6) in 1 req get (1 to 6) in 1 dest read[2: 0](1 to 6) in 3 data in input fifo[7: 0](1 to 6) in 8 full input fifo (1 to 6) out 1 data out output fifo[7: 0](1 to 6) out 8

The router design and its basic building blocks were synthesized using ARM’s standard cell library for TSMC 65nm CMOS process with Vdd=1.0 Volts in a typical corner using Synopsys Design (DC) tool. The gate level netlist then ob- tained from this step was imported as an input to Synopsys Prime Time-PX (PTPX), a tool for power calculation. Timing was verified prior to power calculations. To cal- culate the avg. power consumption based on switching activity different experiments were performed using PTPX to measure the power consumption of the design at var- ious operating frequencies. In order to analyze the power consumption of a design,

70 it is important to consider all the factors which contribute to both the static and

dynamic power. These are shown in the Figure 3.18. The design netlist is required to

Figure 3.18: Power Analysis Requirements

determine the design activity and type of cells used in the design and to accurately compute the capacitance on the drivers. Cell Library models are necessary in order to compute the internal power of the CMOS cells used in the design and the signal activity of the design affects both the static and dynamic power consumption. The static power(cell leakage) is often state dependent, and the dynamic power is directly proportional to the toggle rate of the pins. Net parasitics (or capacitances) affect the dynamic power of the design. Switching power is directly proportional to the net capacitance. Internal power depends on both the input signal transition times, which are determined by the net parasitics, as well as the output load, which is a combination of the net parasitics and input pin capacitances of the fanout. Figure

3.19 shows the methodology used to measure the power consumption of the router design using the standard switching activity file format.

In this flow, the analyze and elaborate commands read the RTL design into an active memory and convert it to a technology-independent format called the GTECH

71 Figure 3.19: Power Measurement Flowchart for NoC Routers using Synopsys Tools

design. This is done using Design Compiler tool, which is the core of the Synopsys synthesis software. Then, a forward-annotated Switching Activity Interchange For- mat (SAIF) file is generated using the rtl2saif command of Synopsys. This forward- annotated file contains directives that determine which design elements to be traced during simulation. The forward-annotated SAIF file is fed into the simulator with the

VHDL test bench and technology files to generate a back-annotated SAIF file. The back annotated SAIF file contains information about the switching activity of the

72 synthesis-invariant elements in the design. Then, the back-annotated SAIF is used with the gate-level net-list file(Data-Base (DB) file) produced by the Design Compiler to calculate the power consumption of the router. Power Compiler-PX is used for calculating the power consumption of the switches.

Power dissipation in NoC, arise from two different sources: 1) Switches and 2)

Interswitch Links. A comparative analysis for power consumed by different compo- nents of router is shown in Figure 3.20 with area estimates. For Design-II, an average

Figure 3.20: Power per Component

power was measured using a wireload model for the synthesis and static switching activity on all input ports. In relative terms, the arbiter’s contribution to total power is lower than 8% while input and output buffers consume approximately 70 % of total power. This breakdown reflects typical power consumption percentage for wormhole switches at individual level. Router Power is a direct function of number of ports

73 required. Different NoC architecture require different number of switches with differ- ent number of ports. For example, a 16 IP Cliche architecture will have 4 switches with 5 ports, 8 switches with 4 ports and 4 switches with 3 ports. At network level, the aggregated power consumption of switches play a larger role in determining the total power consumption. The total power consumed by different NoC architecures is shown in Figure 3.21 and is also listed in Table 3.6. BFT architecture consume

Figure 3.21: Total Power Consumed by Routers for Different Number of IPs

the minimum power and SPIN the maximum power. The rate of increase is more pronounced with the increase in the number of IPs.

Low Power Router Design: Total power consumed by a NoC router is the sum of dynamic and leakage powers. Dynamic power, or switching power is primarily the power consumed when the device is active - that is, when signals are changing values.

Static power is the power consumed when the device is powered up but no signals

74 Table 3.6: Power Overhead of Routers for Different NoC Architectures in RVT process(f = 200MHz,α = 0.1)

Power Consumption (mw) Architecture 16 IPs 64IPs 256 IPs 1024 IPs Cliche 41.64 188.20 798.12 3285.16 BFT 21.08 106.00 489.04 2021.20 SPIN 31.56 211.52 846.08 3384.32 Octagon 42.44 178.64 732.32 2929.28

are changing value. The first and primary source of dynamic power consumption in

a CMOS logic gate is switching power - the power required to charge and discharge

the output capacitance on a gate. Dynamic power dissipated by a CMOS design is

largely described by the equation:

2 Pdyn = αClVddfclock

Where Cl is the capacitance (a function of fanout, wire length and transistor size),

Vdd is Supply Voltage, α is activity factor (meaning how fast or often wire switches) and f is the clock frequency. Because, dynamic power is linearly proportional to switching frequency, using lower frequency (if possible) is a primary way to reduce the switching frequency and thereby power. Table 3.7 list power consumption of

Design-II NoC routers for various operating frequencies for an α = 0.1. As we can see from the data, power consumed by any switch is a direct linear function of frequency.

The rate of increase for a 6port router for the shown frequencies is almost 2mw per

100 MHz increase in frequency. The fact that dynamic power is linearly proportional to the capacitance being switched is more of a design and implementation constraint

75 Table 3.7: Power Consumption of 4-, 5-, 6-,7- and 8- port NoC routers at various operating frequencies

Power Consumption (mw) Frequency 4-Port Router 5-Port Router 6-Port Router 8-Port Router 500 MHz 6.38 8.22 10.05 13.96 450 MHz 5.73 7.40 9.00 11.86 400 MHz 5.07 6.57 7.95 10.84 350 MHz 4.46 5.72 6.95 9.42 300 MHz 3.80 4.91 5.96 8.51 250 MHz 3.18 4.12 4.97 6.86 200 MHz 2.56 3.30 3.99 5.33 150 MHz 1.94 2.50 3.02 4.38 100 MHz 1.30 1.68 1.94 2.92

and is improved primarily by reducing the length of inter-router interconnects driven

and design complexity or by area. The voltage term has the greatest effect on power,

and if the frequency can be reduced to allow a reduction in the voltage, the power is

reduced quadratically. Leakage Power (or Static power), is a function of the supply

voltage (Vdd), threshold voltage (Vt), and transistor sizes. The scaling of threshold voltages has been a large factor in the increasing leakage currents in smaller technology generations. A trend for leakage power vs. technology nodes in Intel processors is shown in Figure 3.22. Basically, at 90nm and below technology nodes, leakage power management is essential in the design process. Leakage power can consume up to half of the total power dissipated by transistors. There are four main sources of leakage currents in a CMOS gate : (i) Sub-threshold Leakage - the current which flows from

76 Figure 3.22: Leakage Power vs Technology Nodes [5]

the drain to the source current of a transistor operating in a weak inversion region,

(ii) Gate Leakage - the current which flows directly from the gate through the oxide to the substrate due to gate oxide tunneling and hot carrier injection, (iii) Gate Induced

Drain Leakage - the current which flows from the drain to the substrate induced by a high field in a MOSFET drain caused by a high VDG. and (iv) Reverse Bias Junction

Leakage - caused by minority carrier drift and generation of electron/hole pairs in

depletion regions. A guide to help in when to consider what type of leakage is provided

in Table˜reftab:sug. There are several approaches to minimize leakage current. One

technique is known as Multi-VT : using high VT cells wherever performance goals allow

and low VT cells where it is necessary to meet timing. In an ASIC design flow, there

are three types of library cells HVT, LVT, RVT. The design of the router Design-II

was evaluated for different frequencies to see the tradeoff. Leakage power savings

for a 6-port router design in percentage are shown in Figure 3.24. ent frequencies

77 Table 3.8: A Guide for Leakage Power Considerations

Parameter L(µm) Tox(m) Isub Igate Ijunc Long Channel > 1 > 3 x x x Short Channel > 0.18 > 3 y x x Very Short Channel > 0.090 > 3 y y x Nanometer Channel < 0.090 < 2 y y y

to see the tradeoff. Leakage power savings for a 6-port router design in percentage are shown in Figure 3.24. LVT process consume the highest leakage current whereas

Figure 3.23: Difference in Leakage Power for a 6 Port Router Design using Different Vt Cells for a 6 Port Router Design

HVt the minimum. The design was functional for wide range of frequencies, with the major tradeoff of the area. Figure 3.24 shows the increase in area with frequencies for all the three different processes. To meet the timing and compensate for the

78 Figure 3.24: Frequency vs. Area of the Switch

delay in HVT process, smaller cells are being exchanged for bigger cells. Thus HVT process consumes the maximum area to meet the timing requirements. The routers were implemented in all the processes, and the power consumed by different NoC architectures is listed in Table 3.9. Even though LVT process has higher leakage

Table 3.9: Power Dissipation for a Network of 64 IPs at 200 MHZ and α = 0.1

Power Consumption (mw) Area (mm2) Architecture LVT HVT Reduction (%) LVT HVT Reduction (%) Cliche 198.52 450.44 55.93 44.92 45.04 0.27 BFT 111.84 207.76 46.17 25.36 25.48 0.47 SPIN 235.52 406.4 42.05 49.29 49.39 0.20 Octagon 188.4 352.92 46.62 42.79 42.89 0.23

79 power, but the total power consumed by the architectures in LVT process is much less than that by the HVT process. By using LVT process for a system of 64IPs as much as 56% power savings can be achieved for the Cliche architecture. Again BFT is more power efficient in both the processes.

3.4 Summary

In a NoC design, routers consume significant amount of power and present area cost. In this chapter NoC router design is discussed in detail and is evaluated for power consumption, area and performance at various operating frequencies. Based on the router power, total power consumption for different NoC architectures is evaluated for different number of IPs. The main goal of a NoC design is to consume low power while serving the same performance. To achieve low power NoC design, it is essential to implement power saving techniques at each level of design abstraction. Within router design different approaches applicable to save dynamic and static power are discussed and evaluated in this chapter.

80 CHAPTER 4

High Performance NoC Interconnects

Early CMOS processes had a single metal layer, and for many years only two to three metal layers were available, but with advances in chemical-mechanical polishing and in some other semiconductor processes, it is far more practical now to manufac- ture many more layers in silicon technology than ever before. As shown in Figure 4.1

Figure 4.1: Metal Layers in different technology nodes

a 180 nm process has 6 metal layers and the layer count has been increasing at a rate of about 1 per technology generation. Table 4.1 lists the technology parameters and equivalent circuit model parameters from ITRS reports (2001-2010) for technology

81 nodes 130nm to 11nm. Minimum interconnect spacing Smin is assumed to be equal

Table 4.1: Technology and Circuit Model Parameters from ITRS Reports(2001-2010)

Year of Production 2001 2004 2007 2010 2013 2016 2019 2022 Technology Node (nm) 130 90 65 45 32 22 16 11 Number of Metal Layers 9 10 11 12 13 13 14 15

Metal 1 Wire Wmin(nm) 130 90 65 51 31 22 15 11

Int. Wire Wmin(nm) 225 132 98 51 31 22 15 11

Global Wire Wmin(nm) 335 230 145 68 48 33 24 16 Global Wire T(nm) 670 482 319 236 112 77 56 37

Tox (µm) 6.3 4.7 3.9 2.9 2.4 1.9 1.4 1.04

Relative Permitivity r 3.3 3.1 3.0 2.7 2.6 2.3 2.1 1.8 Resistivity ρ(10−6Ω.cm) 2.2 2.2 2.2 2.2 2.06 2.06 2.06 2.06

to the minimum interconnect width Wmin in all technologies. k is the dielectric con- stant and ρ is the resistivity of the Cu material. As CMOS technology is continuing to scale, wiring delay is dominating gate delays. Figure 4.2 shows gate delay vs. wire delay in different technology nodes. Wiring delay doubles with each technology node and increases quadratically as a function of wire length. Even a small number of global interconnects, where the signal delay is very high, can have a significant impact on system performance and may also influence timing closure of the design. Addition- ally, with technology scaling, global wires are currently undergoing a reverse scaling process resulting in wider and thicker top metal layers and an increase in the wire

Aspect Ratio (AR). This leads to increase in self and coupling capacitances, which causes global communication to become an increasingly power-consuming process.

82 Figure 4.2: Gate Delay vs. Wire Delay in Different Technology Nodes

In NoC network, the interswitch wire segments are the longest on-chip wires after clock, power and ground wires [39]. Mostly global and semiglobal wires are suitable and recommended for NoC interconnects, since for NoC interconnects, one of the most critical challenge is to provide the desired system bandwidth for the SoC design. It is increasingly important for many large scale SoC designs to have higher bandwidths to satisfy massive inter-processor communication. Bandwidth is critical, in part because higher bandwidth decreases contention, and in part because phases of a program may push a large volume of data around without waiting for transmission of individual data items to be completed. In order to achieve higher bandwidth in NoC, it is possible to design pipelined routers such that they process one flit per cycle, but the duration of the clock cycle usually determines how fast each flit can be processed in the network. In nanometer NoCs, this cycle time is not limited by the logic in

83 between two clocked elements, but by the links between two routers, and thereby limiting the system performance.

Continued scaling of technology has thus posed a set of new challenges to the design community. Interconnects in deep nanometer technologies suffer from three major problems: (1) Large propagation delay due to capacitive coupling induced by tighter geometries (delay problem), (2) High power consumption due to increase in both self and coupling capacitances (power problems) and (3) and increased suscepti- bility to errors due to increase in deep submicron [40]. As a result, interconnects can no longer be ignored in the design cycle and must be accounted for early in the process. In this chapter, a layout-aware analysis of NoC interconnects is presented to achieve a high performance and low power NoC design .

4.1 NoC Interconnects

In a NoC design, wires linking two switches are called interconnects. NoC inter- connects play a crucial role in communication and can have a large impact on total power consumption, wiring area and system performance. One of the most critical challenge of a NoC design is to provide the desired bandwidth set forth by the main

SoC design to manage certain performance threshold. However, as technology is scal- ing in nanometer domain, achieving higher bandwidths for communication channels could be tricky and may requires some mitigation schemes [41][42]. NoC links typically consists of number of parallel signal wires of fixed width and spacing as shown in Figure 4.3. These links can be used directly to express number of metrics, such as Data Rate, Bandwidth Density or Bisectional Bandwidth. However Channel

Bandwidth (Data Rate) is preferred and most appropriate metric to estimate system

84 Figure 4.3: NoC Interconnects

performance and can be expressed as follows

N ChannelBandwidth = (4.1) Delay

Where N is the total number of signal wires in the link, and Delay is the delay of a single wire. In order to achieve higher bandwidth, it is thus important that the delay is minimum. Interconnect delay is a function of wire resistance and capacitance. The delay of a distributed RC line driven by an ideal driver (zero output impedance) at the near end, and an open termination at far end can be expressed using the Elmore

Delay model.

2 tDelay = 0.4RCL (4.2)

Where L is the wire length, R is the wire resistance per unit length, C is the capac- itance per unit length and Delay is the wiring delay. This is a good approximation and is reported to be accurate within 5 percent for a wide range of R and C values.

In NoC design, the minimum conceivable clock cycle time in a highly pipelined design can be assumed to be equal to the value of 15FO4, with FO4 defined as the delay of an inverter driving four identical ones [16]. In different technology nodes, FO4 can

85 be estimated as

425 × Lmin (4.3)

where Lmin is the minimum gate length in any technology node [43]. Figure 4.4 shows that the requirement of 1 clock cycle on the resource size. For a high performance design (operating at its max. frequency), the interconnect delay between resources should be less than 15FO4 time. In long wires the intrinsic RC delay can easily

Figure 4.4: One Clock Cycle Requirement for High Performance NoC Designs

exceed this 15FO4 limit and thereby limiting the clock cycle time of the design and as a consequence system bandwidth may suffer. Figure 4.5 shows intrinsic RC delay in different technology nodes and 15FO4 limits. In NoC length of interconnects is a function of die size, architecture and number of IPS, thus depending on the length of the wires different techniques may be necessary to reduce intrinsic RC delay.

4.1.1 Performance Optimization Using Intrinsic RC Model

The delay of a wire is a function of the RC time constant (R is the Resistance and C is the Capacitance). In older technologies, wires were wide and more cross

86 Figure 4.5: Intrinsic RC delay and 15FO4 limit

section area implied less resistance and more capacitance. It was okay to model wires

with capacitance only, but as fabrication technologies scale down, width of wires is

reduced [44]. As a result, wire resistance per-mm is increasing and is no longer negli-

gible. Short wires can be modeled with a lumped RC approximation. The resistance

per unit length of a wire can be defined by the following expression: Where R is the

resistivity, L is the length, T is the thickness and TW is the cross section area of the

wire. ρ is the metal resistivity and is dependent on the material used. Aluminum, once the main material for interconnect has long been replaced by Copper for its low resistivity (2.2 Ohm). In newer technologies copper is being used, but in nanome- ter technologies even the copper interconnects are becoming increasingly inadequate to meet the speed and power dissipation goals of highly scaled ICs. Table 4.2 list

87 Figure 4.6: Interconnect Resistance

resistivity of some of the commonly used materials. Wire capacitance, on the other

Table 4.2: Bulk Resistivity of pure metal at 22 degree C

Metal Resistivity (µΩ.cm) Silver(Ag) 1.6 Copper (Cu) 1.7 Gold (Au) 2.2 Aluminum (Al) 2.8 Tungsten(W) 5.3 Molybdenum (Mo) 5.3 Titanium (Ti) 43.0

hand is more complex as many of its sub-components are geometry dependent. It

is usually carried out by representing complex structures as a collection of simple

geometric elements, and then each parasitic value is combined using superposition or

by introducing scale factors to obtain the parasitics of the complex structure. There

are many commonly used industrial tools which simply extract the wire capacitance

88 parameters for a given structure. Some of the tools are FastCap, FastHenry, StarRC

and QRC etc. For modeling and estimation purposes, however some simple tech-

niques are applicable and can be used. As shown in Figure 4.7 the capacitance per

unit length of a wire can be modeled by four parallel-plate capacitors for each side

and fringing capacitance. An accurate modeling of wire capacitance is, however a

non-trivial task and still an ongoing subject of advanced research. The three major

Figure 4.7: Cross-Sectional View of Semi-Global Layer Interconnects

components of wire capacitance shown in Figure 4.7 are related to the geometry by

the following relations C C = C + 2.C .W + c (4.4) T a b S

Where CT is the total capacitance, Ca is the fringing capacitance, Cb is the parallel plate capacitance due to the top and bottom layers of metal and is proportional to the interconnect width, and Cc is the coupling capacitance between neighboring interconnects and is inversely proportional to the interconnect spacing S. The parallel

89 plate capacitance Cb can be described as

L C =  (4.5) b ox H

Where W is the width of the wire, L is the length of the wire, and H is the dielectric

height and ox is the dielectric constant. The ox can be defined as follows

ox = r × o (4.6)

Where r is the permitivity of the dielectric material. SiO2 is mostly is the dielec- tric material of choice in integrated circuits. Recently some materials called Low-k dielectrics with lower permittivity are coming in use in newer technologies to reduce wiring capacitance. It is an attractive option because it reduces both wire delay and power consumption. Adding fluorine to to the silicon dioxide creates fluorosili- cate glass (FSG) with a dielectric constant of 3.6, widely used in 130nm processes.

Adding carbon to the oxide can reduce the dielectric constant to 2.7-3.0. Alterna- tively, porous polymer-based dielectrics can deliver even lower dielectric constants.

For example, SiLK, from DowChemical, has k = 2.6 and can be scaled to k = 1.6-2.2 by increasing the porosity. Developing low-k dielectrics that can withstand the high temperatures during processing and the forces applied during CMP is usually a major challenge. Table 4.3 presents the relative permittivity of several dielectrics used in integrated circuits.

Fringing and coupling capacitance on the other hand are more difficult to compute and require numerical field solver for exact results. However for modeling and esti- mate purposes empirical formulas which are computationally efficient and relatively

90 Table 4.3: Relative Permittivity r of some Dielectric Materials

Material Relative Permittivity r Free Space 1 Aerogels 1.5 Polymicides(Organic) 2-4 Silicon Dioxide 3.9 Glass-epoxy (PC board) 5

Silicon Nitride (Si3N4) 7.5 Alumina 9.5 Silicon 11.7

accurate are " # W  W 0.25  T 0.5 C =  .L. + 0.77 + 1.06 + 1.06 (4.7) a ox H H H

" 0.222# 4 W   T   T  H  3 C =  .L.S 0.03 + 0.83 − 0.07 (4.8) c ox H H H S

These empirical formulas are accurate to within 6% for processes with AR less than

3.3 [35]. From equation 4.4, it can be seen that increasing the width of the wire can significantly decrease resistance, while also resulting in a modest increase in ca- pacitance due to the top and bottom layers. This leads to less than proportional increase in capacitance, still the overall RC delay improves. Similarly, increasing the spacing between adjacent wires reduces capacitance to the adjacent wires and leaves resistance unchanged. This also reduces RC delay by significantly reducing the cou- pling capacitance. While T and H parameter are fixed for each metal layer in a given

91 process technology, parameters W and S can be chosen by the link designer to achieve an acceptable delay. By allocating more metal area per wire and increasing the wire width and spacing, the overall effect is that the product of Rwire and Cwire decreases,

resulting in lower wire delays. If a design is limited by the wiring space available then

varying W and S for optimal delay will have an impact on the number of wires in

the link by the following relation

Awire = N.W + (N − 1).S (4.9)

The primary difference between wires in the different types of metal layers is the wire

width and spacing (in addition to the thickness). Increasing interconnect width and

space in a limited area will reduce the number of links and thus the overall system

bandwidth. As a result, these geometric adjustments to achieve lower delay can create

an upper bound on the conceivable bandwidth.

4.1.2 Performance Optimization using Repeater Insertion

Widening a uniform line has a marginal impact on the overall wire delay [10].

The resistance and capacitance of a wire are both linear functions of the wire length.

Hence, the delay of a wire, which depends on the product of wire resistance and

capacitance, is a quadratic function of wire length. For longer NoC interconnects,

wire sizing and spacing alone may not be sufficient to limit this quadratic growth. A

simple and more effective strategy for reducing the delay of a long interconnect is to

strategically insert buffers along the length of the line. These buffers are typically

called repeaters and the process is called repeater insertion. In this technique, the

delay of a wire is reduced by splitting the line into multiple smaller segments of

equal lengths and by inserting a repeater between each segment to actively drive the

92 wire. As a result, wire delay becomes a linear function of wire length. Figure 4.8

shows an interconnect line with inserted repeaters. In repeater insertion, usually the

Figure 4.8: Interconnect with Repeaters

decreased interconnect delay is partially offset by the additional delay of the inserted

repeaters. Overall wire delay can be minimized by selecting optimal repeater sizes

and spacing between repeaters [43] and this technique is commonly employed in

modern-day processors. A number of repeater insertion methodologies for different

types of optimization exists [45][46][47]. The minimum delay of resulting RC

circuit is achieved when the delay of the repeater equals the wire segment delay.

Using methodology presented in [48], optimal repeater size hopt and optimal inter

buffer segment line length kopt can be calculated using equations 4.10- 4.11. r Rs.C hopt = (4.10) R.Cs

r 2.R (C + C ) k = s s p (4.11) opt R.C

Where Rs and Cs are the resistance and capacitance of a minimum size inverter, R is the resistance of wire per unit length, and C is the capacitance of wire per unit

93 length. Similarly optimal width and optimal spacing are given in equation 4.12 and

4.13 respectively.

r CaSopt + Cc Wopt = (4.12) Cb

s CcWopt Sopt = (4.13) Ca + CbWopt

If an interconnect is divided into n number of segments with a repeater driving each

section, then the total wire delay equal to the number of repeated sections multiplied

by the individual section delay. Furthermore the delay of one segment length hopt

driven by a buffer of size kopt is given by s ! 2Co τopt = 2 ∗ Rs (Co + Cp) 1 + (4.14) Co + Cp

The intrinsic RC delay of the longest interconnect (10mm) in 65nm technology is calculated to be 3537.6 ps, whereas the 15FO4 time is 414.375 ps in the same tech-

nology node . The achievable frequency by the technology node definition is 2.41

GHz, whereas due to the length and associated delay of the longest interconnect; the

achievable frequency is limited to 0.28 GHz only. Using equations 4.10- 4.13, for

buffer insertion, width and spacing optimization the delay can be improved to 381.94 ps (frequency of 2.61 GHz). With optimal repeater insertion, the growth of the in- terconnect delay becomes almost linear with the wire length. However for large high performance NoC design, the total number of such repeaters can be prohibitively large and can take up significant portion of silicon and routing area and additionally can consume significant amount of power. Power dissipation is discussed in next section.

94 4.2 NoC Power Consumption In Physical Links

Power consumption in on-chip interconnect represents a significant portion of the

chip total power consumption (up to 50%) [49]. In NoC, interconnect length is a func-

tion of topology, die size and number of IPs. Considering a chip size of 20mm×20mm, technology node of 65nm, and a system of 64 IP blocks, the number of interswitch links and total length is obtained for different NoC architectures and is presented in Table I. When the interconnect length is larger than the critical length, repeater insertion is necessary. For a node 65nm technology node, using methodology pre- sented in [48], the critical interconnect length is 1.44 mm. The links are assumed to be 16 bit wide and 2 control signal lines per chanel link. Using optimal inter- connect width of 799, optimal interconnect spacing of 329; optimal repeater size of

105, power consumption of inter switch links and repeaters is calculated for different

NoC architectures and is presented in Table 4.6. Power Overhead as a percentage of total power consumption is presented. Repeater insertion along with spacing and width optimization, may consume as much as 6 percent of total chip area, which is significant considering; routers are not included. Similarly, repeaters alone can con- sume as much as 1048 mw of power in SPIN topology. This again is quite significant, considering the size and high end power budget of the chip. The difference in the area and power budgets of a logic design vs. physical design can either offset a design completely or may result in endless iterations to get a high performance design to work. As mentioned previously this problem really demands for an improved design

flow to comprehend the problems imposed by the interconnects in the deep nanometer regime. A much more efficient design flow is proposed and shown in Figure 4.9.

95 Table 4.4: Interconnect Power and Area Consumption: Intrinsic Case, f=400 MHz and α = 1

Architecture Total Wire Max. RC Bandwidth Pline (mW) Area Length (mm) Delay (Gbps) (mm2) (ps) Cliche 5040 221.11 72.36 374.88 1.4616 BFT 8640 3537.69 4.52 642.64 2.5056 SPIN 20160 3537.69 4.52 1499.50 5.8464 Octagon 5125 1989.95 8.04 381.20 1.4863

Table 4.5: Interconnect Power and Area Consumption: Width and Space Optimiza- tion, f=400MHz and α = 1

Architecture Total Wire Max. Bandwidth Pline (mW) Area Length (mm) Length (Gbps) (mm2) (mm) Cliche 5040 2.5 507.514 231.50 5.6851 BFT 8640 10.0 84.49 396.85 9.7459 SPIN 20160 10.0 84.49 925.99 22.7405 Octagon 5125 7.5 116.12 235.40 5.7810

Table 4.6: Total Power and Area Consumption, f=400MHz and α = 1

Architecture Total Power Area Total Power Total Area No. of (mW) (mm2) (mW) (% ↑) (mm2) (% ↑) Reps Cliche 2016 190.51 0.01291 422.01 (11.2 ) 05.6980 (74.35) BFT 4608 435.46 0.02951 832.31 (22.8) 09.7754 (74.37) SPIN 11520 1088.64 0.07379 2014.63 (25.6) 22.8143 (74.37) Octagon 2592 244.94 0.01660 480.35 (20.6) 05.7976 (74.37)

96 4.3 A Layout-Aware NoC Design Methodology

Interconnects in deep nanometer regime pose real challenge to meet system per- formance and require optimization to meet bandwidth requirements. These optimiza- tions along with power hungry global interconnects can make a big difference in area and power cost of the total system. The difference in the budget of logic design vs. physical design can either offset a design completely or may result in endless iterations to get a design to meet its design goals. As a consequence the traditional top-down approach taken in many design processes is not sufficient to deal with this problem and accurately account for the interconnects simultaneously with the rest of the de- sign flow. The challenges associated with interconnects require a new and improved design flow and innovative optimization tools that would help to accurately model the complex relationship that exists at the nanometer scale. An efficient design flow based on NoC interconnect modeling is presented [50] and is shown in Figure 4.9.

Modeling and Simulation is a key to account for interconnects in the early stages of design. The modeling and simulation capabilities can range from high-level pre- dictions of interconnect impact on the IC layout and electrical behavior to low level rough estimates.

4.4 Summary

Wires are not ideal as drawn in schematic diagrams, there are huge parasitics associated with the interconnects which exhibit undesired effects and hinder the per- formance. The impact of parasitics is more pronounced in deep nanometer regime.

Interconnect performance analysis methodologies are thus highly important for a suc- cessful design and shorter design cycle time. An accurate estimations of interconnect

97 Figure 4.9: An Improved ASIC Design Flow for NoC in Deep Nanometer Regime

delay, power, and area early in the design cycle is crucial for effective system-level optimization. The commonly used top down design approach for digital design is no longer valid in the presence of deep nanometer effects and can lead to misleading design targets. Interconnect design in the nanometer geometries is a compromise between density, RC performance, and cost. Narrow wires deliver high-density but relatively poor RC performance, while wide wires have better RC performance. Man- aging these factors through interconnect modeling and design methods is necessary to accurately compensate for power and delay estimations for physical design opti- mization and therefore for a High performance SoC design.

98 CHAPTER 5

Layout Aware NoC Design Methodology

In earlier generations of IC designs, the main parameters of concern were timing and area. EDA tools were designed to maximize the speed while minimizing area.

Power consumption was a lesser concern. CMOS was considered a low-power tech- nology, with a fairly low power consumption at the relatively low clock frequencies used at the time, and with negligible leakage current. In recent years, however, de- vice densities and clock frequencies have increased dramatically in CMOS devices, thereby increasing the power consumption dramatically. At the same time, supply voltages and transistor threshold voltages have been lowered, causing leakage current to become a significant problem. As a result, power consumption levels have reached their unacceptable limits, and low power design has become as important as timing or area overhead in any design.

High power consumption can result in excessively high temperatures during op- eration. This means that expensive ceramic chip packaging must be used instead of plastic, and complex and expensive heat sinks and cooling systems are often required for product operation. Laptop computers and hand-held electronic devices can be- come uncomfortably hot to the touch. Higher operating temperatures also reduces reliability because of electromigration and other heat-related failure mechanisms.

99 In portable and hand held devices, high power consumption reduces battery life.

As more and more features are added to a product, power consumption increases and the battery life is reduced even more requiring a larger, heavier battery or shorter life between charges. Battery technology has lagged behind the increased demands for power. Another aspect of power consumption is the sheer cost of electrical en- ergy used to power millions of computers, servers, and other electronic devices used on a large scale, both to run the devices themselves and to cool the machines and buildings in which they are used. Even a small reduction in power consumption of a microprocessor or other device used on a large scale can result in large aggregate cost savings to users and can provide significant benefits to the environment.

NoCs are being considered as a potential candidate to solve on-chip communica- tion problems in large scale SoC design. As technologies continue to shrink into deep nano meter regime there are many challenges faced by these complex SoC designs including the first and the foremost important goal of low power design. Low power design is essential, as feasibility of NoC heavily depends on the power budget it may require. As power becomes one of the major limiting factor in IC designs, designers need new capabilities across entire design flow to develop a keen understanding of the power consumption sources and tradeoffs among different NoC topologies. To aid in this process, this chapter presents a high level power estimation methodology for different NoC architecture designs.

5.1 CMOS Power Dissipation

Before dwelling deep into the techniques to save power, lets first look into the basics and examine what we mean by these terms and why they are important. The

100 instantaneous power P (t) consumed or supplied by a circuit element is the product

of the current through the element and the voltage across the element [35]

P (t) = I(t)V (t)

The total power consumed by an SoC design consists of dynamic power and static power. Dynamic power is the power consumed when the device is active - that is, when signals are changing values. In CMOS devices dynamic power is consumed mainly because of (i) charging and discharging load capacitances as gates switch and

(ii) short-circuit current while both PMOS and NMOS stacks are primarily on. The

first and primary source of dynamic power consumption is switching power -the power required to charge and discharge the output load and is defined as

2 Pswitching = αCVDDf (5.1)

Where α is the switching activity factor, C is the load capacitance, VDD is the supply voltage and f is the clock frequency. Note that switching power is not a function of transistor size, but rather a function of switching activity and load capacitance.

Thus it is very much data dependent. On the other hand, Static power is the power consumed when the device is powered up but no signals are changing value. Static power consumption in CMOS devices is mainly because of leakage and consists of (i) subthreshold leakage through OFF transistors, (ii) gate leakage through gate dielectric and (iii) junction leakage from source/drain diffusions. Putting this all together the total power of a circuit can be defined as

Ptotal = Pdynamic + Pstatic (5.2)

where

Pdynamic = Pswitching + Pshortcircuit (5.3)

101 Pstatic = (Isub + Igate + Ijunc)VDD (5.4)

Power dissipation is a significant constraint in any large scale chip design including

NoC. The elevation of power to first-class design constraint requires that power esti-

mates should be done at the same time as performance studies are in the design flow.

In NoC the opportunity to influence power consumption varies differently at each

level of design abstraction. The higher the level of design abstraction, the greater the

influence power management techniques have on power consumption. As shown in

Figure 5.1, system designers can have roughly an order of magnitude more impact by

addressing power issues early in the design process. There are number of techniques at

Figure 5.1: Diminshing Returns of Power

the architectural, logic design, and circuit design that can reduce the power for a par- ticular function implemented in a given technology. In NoC, at the architecture level, power saving are achieved mainly through different architecture choices available for the design. Since architectural decisions often cannot be reversed, power estimates

102 must be done early in the design phase. A natural solution to this problem is to de- velop NoC power models at architectural level. Power modeling at architectural level can provide rough estimates and additionally allows tradeoffs between hardware and software partitioning. Architectural level power modeling can influence both power and performance savings in a design. Power models for different NoC architecture are developed in the subsequent sections.

5.2 Power Analysis for NoC-based Systems

An important challenge for current and future interconnect architectures is to provide low power solution. This demand is mainly driven by advanced applications in the Large SoC designs. However, communication network alone can consume a significant portion of total system power in any design. In modern processors nearly 1/3 of the power is spent in logic and wires. For instance, MIT RAW on- chip network consumes 36% of the total chip power and Alpha-21364 microprocessor dissipates almost 20% percent of total chip power in the interconnects alone [33]. In

NoC design, power consumption is mainly because of three components, namely (1)

Routers (2) Interconnects (Wires) and (3) Repeaters. The total power dissipation in the network can be defined using the following equations

Ptotal = Pswitches + Pline + Prepeaters (5.5) where

2 Pline = αCLVDDf (5.6)

2 Prepeaters = Nrep(αhoptCoVDDf + Ileak−repVDD + Ishort−repVDD) (5.7)

103 where f is the clock frequency and VDD is the supply voltage. Pswitches is the to- tal power consumed by the switches, Pline is the total power consumed in inter- switch links, and Prep is the total power consumed by the repeaters [51]. Repeaters are required for long interconnects to enhance system performance. The number of repeaters required depends on the length of the interswitch link and technology node used. Different NoC architecture require different number of switches, different lengths of interswitch interconnects and thus different number of repeaters. The dif- ference in these module for same number of IPS and chip size can thus create huge variations in power numbers.

5.2.1 Cliche Architecture Power Model

Cliche architecture or 2D-mesh is the most commonly used NoC topology in lat- est commercial products and industrial prototypes. As shown in Figure 5.2, it is a uniform, scalable and a simplest form of architecture from layout perspective. In

Figure 5.2: Layout of Cliche architecture

104 this architecture, all the interswitch wire segments are of same length and can be determined using the following expression. √ Area L = √ (5.8) N

If the numbers of IPs are equal in the x and y direction (m=n) then the number of √ √ horizontal links are equal to the vertical links; and can be calculated using N( N −

1). Depending on technology node, the optimal length for repeater insertion could be obtained by using equation 4.10. Thus for Cliche architecture, the total interconnect length and the required number of repeaters can be calculated using the following expressions

√ √ lCliche = 2. Area( N − 1)Nwires (5.9)

$ √ % √ √ Area Nrep−Cliche = 2. N( N − 1) √ Nwires (5.10) Kopt. N

Using total number of switches, total wire length for interconnects and the total num- ber of required repeaters, the total power consumption for the CLICH architecture can be calculated using the following expression

√ √ 2 PT otal−Cliche = Nsw.Pswitch + 2. Area( N − 1)Nwires.C.VDDf $ √ % √ √ Area + 2. N( N − 1) √ Nwires (5.11) Kopt. N 2  × αhoptCoVDDf + Pleak−rep + Pshort−rep

5.2.2 BFT Architecture Power Model

In Butterfly Fat Tree (BFT) architecture, the IPs are placed at the leaves and switches at the vertices. At the lowest level (0), there are N IPs and IPs are connected

105 to N/4 switches at the first level. A floor plan for 16-IP BFT network is shown in

Figure 5.3. The number of levels in a BFT architecture depends on the total number

Figure 5.3: Layout of BFT architecture

of IPs, and can be calculated using equation 5.13. " # N N levels N = 1 − (5.12) sw 2 2 where

levels = log2 (N) − 3 (5.13) and

N Switches at jth level = (5.14) 2j+1

The wire lengths between switches in the BFT architecture, based on the layout are shown in figure 4. The interswitch wire length can be calculated using the following expression [10] √ Area l = (5.15) a+1,a 2levels−a 106 Where la+1,a is the length of the wire spanning the distance between level ’a’ and level

’a+1’ switches, where a can take integer values between 0 and (levels-1). Thus total length of the interconnects, and total number of repeaters could be calculated using the following expressions √ Area l = (levels.N.N ) (5.16) total 2levels wires

Nrep = N × Nwires (5.17)         l1,0 1 l2,1 1 l3,2 1 llevels,levels−1 × + + + ··· + log (N)−3 kopt 2 kopt 4 kopt 2 2 kopt

Where kopt is the optimal length of the global interconnect in between repeaters. Using the total number of switches, total wire length and the total number of repeaters, the total power dissipation in BFT architecture could be calculated using the following expression. ! √ N 1levels Area P = P . 1 − + (levels.N.N ) CV 2 .f + N.N total switch 2 2 2levels wires DD wires         l1,0 1 l2,1 1 l3,2 1 llevels,levels−1 × + + + ··· + log (N)−3 kopt 2 kopt 4 kopt 2 2 kopt 2  × αhoptCoVddf + Pleak−rep + Pshort−rep (5.18)

5.2.3 SPIN Architecture Power Model

As explained earlier in Chapter2, SPIN topology is based on fat tree topology;

every node has four sons and the father is replicated four times at any level of the

tree. This topology carries some redundant paths, but offers higher throughput at

the cost of added area. SPIN is scalable and uses small number of routers for a given

number of IPs. In a large SPIN (>16 IPs), the total number of switches is 3N/4 [17].

An efficient floor plan for the SPIN architecture is shown in Figure 5.4. Based on this

107 Figure 5.4: Layout of SPIN architecture

floor plan, the interswitch wire length can be calculated using eq. 6.14. The total wire length and number of repeaters can be calculated using the following expressions. √ ltotal = 0.875 Area.N.Nwires (5.19) $√ % $√ % $√ %! Area Area Area Nrep = N.Nwires + + (5.20) 8.kopt 4.kopt 2.kopt The total power consumption of the spin architecture using the total length of the interconnect and the total number of repeaters thus can be calculated using equation

5.21 2N √  P = P . + 0.875 Area .N.N .C.V 2 .f total switch 4 wires DD $√ % $√ % $√ %! Area Area Area (5.21) + N.Nwires + + 8.kopt 4.kopt 2.kopt 2  × αhoptCoVddf + Pleak−rep + Pshort−rep

5.2.4 Octagon Architecture Power Model

A basic Octagon unit consists of eight nodes and 12 bidirectional links. Each node is associated with one IP, therefore number of switches in an Octagon unit equals to

108 the number of IPs. For a system containing more than eight nodes, the octagon unit is expanded to multi-dimensional space using multiple basic Octagon units. An efficient layout scheme for Octagon architecture is shown in Figure ??. Based on

Figure 5.5: Layout of Octagon architecture

this layout scheme, there are four different interswitch wire lengths in the Octagon architecture [18]. First set is connecting nodes 1-5 and 4 -8, second set is connecting nodes 2-6 and 3-7, third connecting nodes 1-8 and 4-5, and fourth is connecting nodes

1-2, 2-3, 3-4, 5-6, 6-7, and 7-8. The interswitch wire lengths can be calculated using the following expressions

3L l = (5.22) 1 4 L l = 13.w .N + (5.23) 2 l wires 4

l3 = 13.wl.Nwires (5.24) L l = (5.25) 4 4 √ √ Where L is the length of four nodes and is equal to (4. Area/ N). wl is the summation of global interconnect width and space. Considering different interswitch wire lengths, the total length of interconnect and total number of repeaters required

109 could be calculated using the following expressions.

7  l = L + 52w N N .N (5.26) total 2 l wires wires oct−units

         3L/4 13wlNwires + L/4 13wlNwires L/4 Nrep = 2 + 2 + 2 + 6 (5.27) kopt kopt kopt kopt

× NwiresNoct−units

Where Noct−units is the number of basic octagon units. The total power dissipation for the Octagon network, can thus be calculated using the following expression √ ! ! Area P = P + 14 + 52w N N .N CV 2 f + N total switches N l wires wires oct−units DD wires          3L/4 13wlNwires + L/4 13wlNwires L/4 × Noct−units 2 + 2 + 2 + 6 kopt kopt kopt kopt 2  × αhoptCoVDDf + Prep−leak + Prep−short (5.28)

5.3 IP Based Design Methodology for NoC

IP based designs are the dominant way to design a large system containing billions of transistors in a reasonable amount of time. IP based design differs from custom designs in that, IPs are designed well; before they are used. Therefore in these designs most of the system requirements, such as bandwidth, area and power consumption are known a priori. The life cycle of finely designed IPs may stretch well over the years from the time they are first created through several generations of technology until their final retirement. However, it is natural that with every generation of technology scaling the capacity to integrate similar type of IPs doubles or its area halves. An example to show natural progression for the number of IPs that can be

fit on the same die due to technology scaling is shown through Figure 5.6 Functional

110 Figure 5.6: Number of cores with technology scaling

IP blocks are not discussed, since they are dependent on the specific application.

They are considered as a set of embedded processors. In NoC, power is a function of number of IPs and Die Size. Depending on the number of IPs and an estimated

Die area required by them, a low power topology can be selected using power models presented earlier. In different NoC architectures power dissipation vary due to the difference in the total interconnect wire length, number of switches, and total number of repeaters required by the topology. Number of repeaters required depends on the lengths of interconnect, and it varies in different topologies. As number of IPS are increased for a given die area, some topologies scale well with shorter interconnect lengths, whereas others do not. The lengths of the longest interconnect for different

NoC architectures, as the numbers of IPs are increased on a 20mm x 20mm die size is shown in Figure 5.7. Interconnects for Cliche architecture scales well i..e the length of the longest interconnect is reducing with increased number of IPs. In other topologies, some interconnect scales, but the longest interconnect does not scale. For a desired bandwidth requirement, longer interconnects may require optimization in

111 Figure 5.7: Length of longest interconnect with increasing number of IPs

terms of width and spacing along with repeater insertion. This results in extra area and power costs. Since power is a most critical design constraint and interconnects consume significant amount of power, it must be pre-accounted for in the design process. To aid designers in early stages of design, architectural power models are presented earlier for rough power estimations and an efficient design flow including this step is shown in Figure 5.8. Architectural level power estimation is extremely important in order to (1) verify that power budgets are approximately met and (2) evaluate the effect of various high-level optimizations, which have been shown to have much more significant impact on power than low-level optimizations.

5.4 Network Power Analysis

A synchronous router from HDL design is implemented using ARM´sstandard cell library in 65nm CMOS TSMC design process. Synopsys’s Prime Time-PX tool is used for calculating average power dissipation. A 6-port switch consumes 9.62 mw

112 Figure 5.8: A Methodology for Power Efficient NoC Design

of power at 200MHz frequency. Using [48], for 65nm technology node, the critical interconnect length is 1.44 mm. The links are assumed to be bidirectional, with 8 data lines and 2 control signal lines per link. The optimal interconnect width of 799, optimal interconnect spacing of 329; and an optimal repeater size of 105 is used. For design space exploration using power models and IP- based design flow presented earlier; power variance among different NoC topologies is shown in Figure 5.9.A range of 16 - 1024 IPs and a die size of 25mm2 to 400mm2 are used. SPIN topology consumes the highest power, whereas BFT is more power efficient. Considering 20mm x 20mm die size, power consumed by different architectures for 16, 64 and 256 IPS is presented in Table 5.1- Table 5.3. Power Consumption by wires and repeaters is

113 (a) Cliche Architecture (b) BFT Architecture

(c) SPIN Architecture (d) Octagon Architecture

Figure 5.9: Total Power of the Network

also presented. A system overhead in terms of 100 Watts [52] full power budget is evaluated. A detailed analysis of power consumptions helps designers to save more power through different approaches that may be applicable. For 256 IPs, repeaters alone can consume as much as 1209.6 mw of power in SPIN topology; this is quite significant, considering the size and high end power budget of the chip. Total power consumed by the BFT architecture is less in all the three cases, however it is important to see how different components are contributing to the total power consumptions.

114 Table 5.1: Power Consumption for 16 IPs

Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 16 1200 720 205.54 0.21 BFT 6 1600 960 160.68 0.16 SPIN 8 2800 1600 563.12 0.25 Octagon 16 1411.73 880 194.96 0.19

Table 5.2: Power Consumption for 64 IPs

Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 64 2800 1120 667.00 0.67 BFT 28 4800 2560 563.12 0.56 SPIN 48 11200 6400 1167.40 1.17 Octagon 64 2846.92 1440 580.77 0.58

Table 5.3: Power Consumption for 256 IPs

Architecture Number of Total wire Total No. Total Power Switches length of Reps Power Overhead (mw) (%) Cliche 256 6000 0 2269.10 2.27 BFT 124 8000 2560 1601.80 1.60 SPIN 192 44800 25600 4669.40 4.67 Octagon 256 5787.70 1280 190982 1.91

115 (a) Cliche Architecture (b) BFT Architecture

(c) Octagon Architecture (d) SPIN Architecture

Figure 5.10: Distribution of NoC Power Consumption

Pie charts from the developed power models are plotted to observe the individual contribution of power by various components to the total power. Figure 5.10 shows the parameterized contribution of power by the routers, interconnects and repeaters for the case of 64IPs in different NoC topologies. It helps designers to focus In Cliche, the biggest source of power consumption is switches. Although it is second to BFT in total power consumption, but it consumes less power in repeaters. The contribution of power by the BFT switches to the total power is lowest among all the topologies.

116 5.5 Summary

Power being a first-order design objective must be modeled early in the design

flow. The difference in power budgets of a logic design vs. physical design can either offset a design completely or may result in endless iterations to get the design to work. Therefore, the need for fast power estimation is a growing requirement in large scale SoC designs including NoCs. Power estimations in the early phases of design help designer to optimize the design for energy consumption and to efficiently map applications to achieve low power solution. Power can be estimated at a number of levels with varying degree of details. In NoC, an accurate estimation of power at architectural level can save ten times as much power as methods used later in the design flow. In this chapter, an efficient design methodology to estimate NoC power at architectural level is presented. The analysis is based on architectural layout and power models. These models can be used to accelerate NoC design process for low power solution and hence timing closure.

117 CHAPTER 6

Power Efficient Asynchronous Network on Chip Architecture

Most conventional SoC designs are synchronous in nature i..e they have a global clock signal which provides a common timing reference for the operation of all the cir- cuitry on the chip. However, trends of increasing die sizes and rising transistor counts may soon lead to a situation in which distributing a high-frequency global clock sig- nal with low skew will become extremely challenging in terms of design efforts and power dissipation. A large part of the power is also spent in the clock tree Network.

Studies show that on an average power consumed by the clock network could be as high as 40% of total power consumption of the chip. The high fraction is cuased by the fact that large global wires in the clock tree are switched often. To solve these two critical issues, several methods are being discussed in research lieteratue [53]. One of the most common solution proposed is using Asynchronous communication be- tween locally clocked regions i..e Globally Asynchronous Locally Synchronous (GALS) communication [54]. The basic idea of GALS is to partition a system into multiple independently clocked domains. Each domain is clocked synchronously while interdo- main communication is achieved through specific interconnect schemes and circuits in a self-timed fashion. Thus the functionality of each subsystem is still described and synthesized along well established synchronous design, while the communication

118 between locally synchronous modules requires specialized asynchronous components.

Due to its flexible portability and transparent features regardless of the felxibility among computational cores, GALs interconnect schme is a top choice for IP based or multi- and many- core chips.

A GALS-based design style fits nicely with the concept of Network-on-Chip (NoC) design. NoC combined with a Globally Asynchronous Locally Synchronous design is a natural enabler for easy IP integration, scalable communication and provides a clear split between different timing domains. In addition, GALS allows the possibility of

fine-grained power reduction through frequency and voltage scaling. Despite these benefits of GALS design approach, Asynchronous NoC research however is still in the early stages and only a limited set of research exists in the area. A port interface highlighting the difference between (a) a Synchronous NoC Switch and (b) an Asyn- chronous NoC Switch is shown in Figure 6.1.

Figure 6.1: Port Interface (a) Asynchronous Design(b) Synchronous Design

The design approach in asynchronous design is based on purely asynchronous clock- less handshaking that uses multiple phases of exchanging control signals (request and

119 ack) for transferring data words across clock domains. The operation of the switch is discussed in more detail in next section.

6.1 Asynchronous NoC Architecture

A typical NoC architecture is composed of multiple routers and network interfaces

(NI) which connects the IP blocks to the network. Figure 6.2 shows an Asynchronous

NoC communication architecture. It is similar to a Synchronous NoC architecture as it has same main basic building blocks. It consists of (i) Switches, (ii) Links and (iii)

IP blocks.

Figure 6.2: Asynchronous NoC Architecture

The main function of the switches is to accept the incoming data flow of packets, compute where to transmit, arbitrate between the potential concurrent data requests and finally transmit the selected data flow onto an appropriate output link. The

IP blocks or nodes of the network are connected to switches through Asynchronous wrapper units and NoC links. The whole asynchronous network is implemented as a

GALS system i..e the IP blocks are Synchronous, while the communication network

120 is implemented as a Quasi-Delay Insensitive (QDI) Asynchronous logic. As shown in Figure 6.2, synchronization and communication between the NoC switch and the

Synchronous unit is through pluasable clock mechanism called SAS (Synchronous-to-

Asynchronous and Asynchronous-to-Synchronous Interfaces). A programmable local clock generator, using a programmable delay line is implemented within each unit to generate a variable frequency in a predefined and programmable tuning range.

The communication between network switches is accomplished using Asynchronous-

Asynchronous (ASAS) interfaces. The switch has different number of ports for dif- ferent NoC topologies. Using GALS approach, the port interface design for the asyn- chronous switch is shown in Figure 6.3.

Figure 6.3: Asynchronous Port Architecture

In asynchronous design, interswitch links include request acknowledge and data sig- nals. Each port of the switch include (i) Input Asynchronous FIFO, (ii) A Header

Decoder and (iii) Controller modules. The messages arrive in fixed length flow control

121 units (flits), and when the input FIFO stores one whole flit, it sends a full signal to the controller for the service of next processing step. If it is a header flit, the header decoder unit determines the destination port and controller checks the status of the destination port. If the port is available, the path between the input and output is established. All subsequent flits of the corresponding packet are sent from input to output using the established path. The two-way handshaking is used for controlling the transmission. The transitions of the request and acknowledgment signals indicate the completion of the transfer [55]. The number of cells in the asynchronous FIFO are equal to the number of bits in one flit. In asynchronous design, the cell consists of Put

Token Controller (PTC) which deals with the put operation, Get Token Controller

(GTC) which deals with the get operation and Data Status Controller (DSC) [56].

An asynchronous FIFO cell implementation along with its data flow details in shown in Figure 6.4.

Figure 6.4: Asnynchronous FIFO Cell

The register is split into two parts, one belonging to the put part (the write port)

122 and one belonging to the get part (the read port). The behavior of cell could be understood by tracing a put operation and a get operation [57]. The cell receives input data as follows: the put token signal (put tok) is asserted after two transitions on input write enable signal (IWE), as shown in Figure 6.5. When put request is received (put req = 1) , and output write enable (OWE) is asserted. This event causes three operations in parallel: (i) the valid signal is asserted ( the state of the cell becomes “full ”) by DSC, (ii)register REG is enabled to latch the input data and

(iii) the cell starts to send the put token to the left cell and reset the put token signal

(put tok=0). When the put req is de-asserted, the OWE is also de-asserted. The cell is ready to start another put operation once the data from REG is out.

Figure 6.5: PTC Circuit

The cell send the stored data in a similar way as shown in Figure 6.6. After two transistions on an input read enable signal (IRE), the get token signal (GT) is as- serted. The register outputs its data onto the global get data bus. When a get request

(get req=1) is received, an output read enable signal (ORE) is deasserted (ORE=0),

123 GT is reset (GT=0), and the state of cell is changed to ”‘empty”’ (valid=0) by the

DSC.

Figure 6.6: GTC Circuit

Using the burst-mode specifications described in [58], [59], [60] and [61]. the burst mode specification of PTC and GTC are shown in Figure 6.7. A burst-mode specification is a mealy type finite-state machine consisting of a set of states, a set of arcs, and a unique starting state [60]-[61]. An arc is labeled with an input burst (a set of transitions on the input signals), followed by an out put burst (a set of transition on the output signals, possibly empty). A burst mode machine waits for a complete input burst to arrive; transition may come in nay order and at any time.

Once the complete input burst has arrived, the output burst is generated, and the machine moves to the next specification state. For example, for PTC specification, the machine starts in state 0 and waits for a rising transition on IWE+ (where + indicates a rising transition); once this arrives, the machine simply moves to state

1 since the output burst is empty. In state 1 the machine waits for the input burst

124 IWE- (where - indicates a falling transition). Once the IWE- has arrived, put tok is generated.

Figure 6.7: Burst Mode Specification of PTC and GTC

DSC has two inputs and an output which indicates when the cell is full. A DSC configuration is shown in Figure 6.8. The output (called busy) is 1 when OWE is 1 and it is 0 when ORE is 0 ( after being 1 previously).

Figure 6.8: DSC Circuit

125 The Asynchronous FIFO, header decoder and port controller are implemented using standard cell library. The clock distribution network for the synchronous architecture is discussed next.

6.2 Synchronous NoC Architecture

In Synchronous architecture, a clock signal is required for all the clocked elements in the switches. A port-to-port interface for a synchronous architecture is shown in Figure 6.9. The Write and Full signals are used in the switch for controlling the operation of synchronous input and out FIFO. The two most common styles

Figure 6.9: Synchronous Switch Port Design

of physical clocking network are H Tree and Balanced Tree. The H-Tree is a very regular structure which allows predictable delay. The balanced tree takes the opposite approach of synthesizing a layout based on the characteristics of the circuit to be clocked. To understand the complexities associated with the clock distribution,

126 (a) 16 IPs

(b) 64 IPs

(c) 256 IPs

Figure 6.10: Clock Tree Network for Synchronous BFT Architecture

127 let us consider BFT architecture. A H-block clock distribution scheme is used for clock distribution and a template scheme for 16, 64 and 256 IPs is shown in Figure

6.10. The IP blocks are shown as white, and the switches as gray squares. A BFT architecture is a 4-ary tree topology with switches connecting 4 down links and 2 up links. Each group of 4 leaf nodes need one switch. At the next level, half as many switches are required (the set of every 4 switches at any level only requires 2 switches at next higher level). This reduction in switches relation continues on each succeeding level. A clock signal is needed across all IP blocks and switches. With the increase in the number of IP blocks, the complexity of the clock tree increases as well.

Maintaining clock tree symmetry and distributing clock in synchronous network thus is a difficult task.

6.3 Power Dissipation

Power dissipation of on-chip network is defined as

Ptotal = Pswitches + Pline + Prep (6.1)

2  Pswitches = Nsw αdataCl−swVddf + NswPsw−l + NswPsw−s (6.2)

where Pswitches is the total power dissipation of the switches forming the network and

αdata is the activity factor of data transfer between two switches. Nsw is the total number of switches in the network, Cl−sw is the load capacitance of the switch, f is the clock frequency and Vdd is the supply voltage. Psw−l is the leakage power of the switch and Psw−s is the internal or short circuit power of the switch. Pline is the total power dissipation of interswitch links and Prep is the total power dissipation of the repeaters, which are required for the longer interconnects. The number of repeaters required depends on the length of the interconnects. For a NOC topology

128 the interswitch wire lengths, and the required numbers of repeaters can be calculated a priori. Power consumption by the interswitch links Pline and power consumption of by repeaters Prep are defined as follows

2 Pline = αCLVddf (6.3)

2 Prep = NrepαhoptCoVddf (6.4) where α is the activity factor of the interswitch link, C is the interconnect capacitance per unit length, L is the length of interswitch links. Prep is the total power consumed by repeaters inserted to minimize signal delay, Nrep is total number of repeaters re- quired, hopt is the optimal repeater size, and Co is the input capacitance of a minimum size repeater. The power consumption by the interswitch links for data transfer sig- nals (Pdata), control signal (Pcs), clock signal (Pclk) and Request/Acknowledgment signals (Preq−ack) are defined as follows:

2 Pdata = αdata [NdataCLdata + Nrep−datahoptCo] Vddf (6.5)

2 Pcs = αcs [NcsCLcs + Nrep−cshoptCo] Vddf (6.6)

2 Pclk = αclk [CLclk + Nrep−clkhoptCo] Vddf (6.7)

2 Preq−ack = αreq−ack [Nreq−ackCLreq−ack + Nrep−req−ackhoptCo] Vddf (6.8)

where αcs, αclk and αreq−ack are the activity factor of the control signals (write, full), clock signal and request/acknowledgment respectively. Nrep−data, Nrep−cs, Nrep−clk and Nrep−req−ack are the number of repeaters required for implementing data transfer, control, clock and request/acknowledgment signals. In order to provide a fair com- parison between the power dissipation of Asynchronous and Synchronous designs, the switches are assumed to operate locally synchronously at the same frequency (f ).

129 The closed form equations for the power dissipation of Asynchronous and Synchronous designs are described as follows:

PSyn = Pswitches−Syn + Pdata + Pcs−Syn + Pclk (6.9)

PAsyn = Pswitches−Asyn + Pdata + Pcs−Asyn + Preq−ack (6.10)

PSyn = (αdata [Cl−sw−SynNsw + NdataCLdata + Nrep−datahoptCo] (6.11)

+ αcs [Ncs−SynCLcs−Syn + Nrep−cs−SynhoptCo]

2 + αclk [CLclk + Nrep−clkhoptCo + Cclk−loadNsw] Vddf + Psw−l + Psw−s

+ NswPSyn−sw−l + NswPSyn−sw−s

PAsyn = (αdata [Cl−sw−AsynNsw + NdataCLdata + Nrep−datahoptCo] (6.12)

+ αcs [Ncs−AsynCLcs−Asyn + Nrep−cs−AsynhoptCo]

2 + αreq−ack [Nreq−ackCLreq−ack + Nreq−ackNrep−req−ackhoptCo] Vddf

+ NswPAsyn−sw−l + NswPAsyn−sw−s

∆p = PSyn − PAsyn (6.13)

∆p is the difference in the power dissipation by the Synchronous design and the

Asynchronous design. Suffix Syn is used in front of all symbol/parameters which refer to the Synchronous design and Asyn is used in from of all symbols/parameters which refer to the Asynchronous design. Using the worst case data input patterns, the αreq−ack is almost equal to αdata. Therefore the power dissipation of Asynchronous and Synchronous designs depends on αcs and αdata. The Asynchronous design is more power efficient when ∆p is greater than zero. Therefore, the condition on the activity factors to achieve low-power design is obtained in equation 6.14.

130 αdata>Aαcs − Bαclk + C (6.14) where

(N CL + N h C ) − (N CL + N h C ) A = cs−Asyn cs−Asyn rep−cs−Asyn opt o cs−Syn cs−Syn rep−cs−Syn opt o (Cl−Syn−swNsw) − (Cl−Asyn−swNsw + Nreq−ackCLreq−ackNreq−ackNrep−req−ackhoptCo) (6.15) CL + N h C + C N B = clk rep−clk opt o clk−load sw (Cl−Syn−swNsw) − (Cl−Asyn−swNsw + Nreq−ackCLreq−ack + Nreq−ackNrep−req−ackhoptCo) (6.16)

C = NswPAsyn−sw−l + NswPAsyn−sw−s − NswPSyn−sw−l − NswPSyn−sw−s (6.17)

We also know that

αdata ≤ αclk (6.18)

Where αclk is the maximum activity factor of the Synchronous design. Therefore,

αdata could not be larger than αclk. Simulation results for the analysis are presented in next section.

6.4 Simulation Results

The design of Synchronous and Asynchronous switches were implemented using

Synopsys EDA tool in TSMC 65nm CMOS technology. Considering a die size of

20mm×20mm, a system of 64 IP blocks, clock frequency (f) of 200 MHz and a supply voltage (Vdd) of 1.0V, power dissipation for Asynchronous and Synchronous architectures is calculated for two different cases.

CASE 1: No clock gating is applied (αclk = 1). The power dissipation for Asyn- chronous and Synchronous BFT architectures is shown in Figure 6.11(a). The power

131 dissipation of Asynchronous architecture is less than Synchronous architecture when

(αdata>Aαcs − B). If the control signal is switched one time per message, the worst value of αcs could be calculated. The minimum length of the message in the design is two flits; one is the header flit and second data flit. Considering the worst case data input pattern and length of the flit 32 bits, the control signals could be switched one time every 64 bits. Therefore, αcs is 1/64 of αdata. The power dissipation could be

presented as a function of αdata. By applying these constraints, the power dissipation

of the Asynchronous and Synchronous architectures is shown in Figure 6.11(b). The

power dissipation of the Asynchronous architectures is less than the power dissipation

of the Synchronous architecture when αdata is less than 0.4. For αdata equal to 0.2, the power dissipation is decreased by 22.5%.

CASE 2: using the clock gating technique in Synchronous design, αclk of 0.5, the power dissipation for Asynchronous and Synchronous architectures is shown in Fig- ure 6.12(a). αcs and αdata could not be larger than αclk (0.5). Considering αcs 1/64 of αdata, the power dissipation of the Asynchronous and Synchronous architectures is shown in Figure 6.12(b). The power dissipation of the Asynchronous architecture is less than the power dissipation of the Synchronous architecture if the αdata is less tha 0.2. From this study, it is easy to conclude that even if clock gating is used the asynchronous design could be still more power efficient than Synchronous design. Ad- ditionally, there is more reliability and integrity in an Asynchronous design especially when clock skews are of concern in a large SoC design. For the Synchronous design, the length of the interconnects required to implement the clock network using H-Tree is : √ ltot−Syn = 22.5 Area (6.19)

132 (a) Power Dissipation in Syn. and Asyn. BFT Architecture

(b) Power Dissipation in Syn. and Asyn. BFT Architecture when αcs = 1/64 of αdata

Figure 6.11: Power Dissipation in Syn. and Asyn. BFT Architecture

133 (a) Power Dissipation in Syn. and Asy. BFT Architecture when αclk = 0.5

(b) Power Dissipation in Syn. and Asy. BFT Architecture when αcs = 1/64 of αdata and αclk = 0.5

Figure 6.12: Power Dissipation in Syn. and Asyn. BFT Architecture

134 Table 6.1: Total Metal Resources Required for BFT Architecture

Total Length Total Number Total Metal Resources Architecture of of (%) (mm) Clock Network Repeaters Reduction Syn. BFT 450 mm 418 5250 – Asyn. BFT – 0 4800 9%

where Area is the die size. Using the critical interconnect length of 1.44 mm, opti- mal repeater size of 105 [48], the number of repeaters and metal resources required to implement Synchronous and Asynchronous architectures are shown in Table 6.1.

The total metal resources required to implement the Asynchronous architecture is reduced by as compared to Synchronous architecture. Using Synopsys EDA tool, the

Asynchronous and Synchronous switches are implemented using VHDL. The Asyn- chronous design increases the area of the switch by 25% as compared to Synchronous

BFT switch design. Considering αcs as 1/64 of αdata and αdata of 0.2 and αclk = 1, the power consumption of Asynchronous and Synchronous BFT architecture designs for different number of IPs is shown in Table 6.2. For 16 IPs, the Asynchronous technique reduces the power dissipation by as compared to power dissipated by the

Synchronous design.

6.5 Comparison

Consider a die size of 20mm×20mm, technology node of 65nm, a system of 64 IPs, clock frequency (f) of 200 MHz and a supply voltage(Vdd) of 1.0 V, Asynchronous architectures for different NoC topologies (Cliche [14],Octagon [18] and SPIN [17]) are

135 Table 6.2: Power Consumption For BFT Architecture

Power Consumption (mw) No. of IPs No. of Switches Reduction (%) Synchronous Asynchronous 16 6 48.94 36.86 24.68 64 28 169.09 130.98 22.54 256 124 514.39 375.15 27.07

designed and implemented. Using the same analysis as for the BFT architecture, all

the other topologies were evaluated for a similar power efficiency comparison [62]. The

resulting analysis depicting the power dissipation of Asynchronous and Synchronous

architectures for Case 1 and Case 2 is shown in Figure 6.13 and 6.14 respectively.

The power dissipation of the Asynchronous architecture is less than the power

dissipation of the Synchronous architecture when αdata>Aαcs−B. The BFT consumes

the minimum power as compared to other NoC topologies. Considering αcs 1/64 of

αdata, the power dissipation of the Asynchronous and Synchronous architectures for

Case 1 and Case2 is shown in Figure 6.13 and 6.14 respectively. According to the values of the activity factors (αdata and αcs), the decision of using the Asynchronous

architecture can be estimated. The Asynchronous Cliche architecture consumes the

minimum metal resources as presented in Table 6.1 and Table 6.6. From Table 6.2,

Table 6.3, Table 6.4and Table 6.5 , the minimum number of switches required for the

BFT architecture. For this reason, BFT architectures consumes the minimum area.

The Asynchronous BFT architecture consumes the minimum power as compared to

the other NoC topologies when αdata is 0.2 and αclk=1 making BFT more attractive

as a power efficient NoC topology.

136 (a) CLICHE (b) CLICHE

(c) Octagon (d) Octagon

(e) SPIN (f) SPIN

Figure 6.13: Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5

137 (a) CLICHE (b) CLICHE

(c) Octagon (d) Octagon

(e) SPIN (f) SPIN

Figure 6.14: Power Dissipation of Syn. and Asyn. Architectures α-clk =0.5 and αcs=1/64αdata

138 Table 6.3: Power Consumption For Cliche Architecture

Power Consumption (mw) No. of IPs No. of Switches Reduction (%) Synchronous Asynchronous 16 16 56.62 42.67 24.63 64 64 201.59 145.28 27.93 256 256 737.91 508.92 31.03

Table 6.4: Power Consumption For Octagon Architecture

Power Consumption (mw) No. of IPs No. of Switches Reduction (%) Synchronous Asynchronous 16 16 53.73 47.09 12.37 64 64 185.08 144.60 21.87 256 256 667.70 488.16 26.82

Table 6.5: Power Consumption For SPIN Architecture

Power Consumption (mw) No. of IPs No. of Switches Reduction (%) Synchronous Asynchronous 16 8 69.34 53.58 12.62 64 48 321.23 246.64 9.59 256 192 1194.36 1041.96 2.73

139 Table 6.6: Total metal resources

Total Length Total Number Total Metal Resources Architecture of of (%) (mm) Clock Network Repeaters Reduction Syn. Cliche 210.0 mm 162 3010.0 – Asyn. Cliche – 0 2800.0 7% Syn. Octagon 132.2 mm 98 2979.1 – Asyn. Octagon – 0 2846.9 4.44% Syn. SPIN 450.0 mm 418 11650.0 – Asyn. SPIN – 0 11200.0 3.86%

6.6 Summary

In this chapter power analysis for Synchronous and Asynchronous NoC architec- ture is presented. The relation between αdata and the efficiency of the Asynchronous design is analyzed. It is shown that Asynchronous network is more power efficient than Synchronous architecture for lower range of αdata (the activity factor of data transfer between two switches). Asynchronous design could reduce the power con- sumption of the network if it satisfies the relation αdata>Aαcs − Bαclk. The area of the Asynchronous switch is increased by 25% as compared to Synchronous switch, however the power dissipation of Asynchronous BFT architecture is decreased by 27% in comparison to Synchronous BFT architecture when αdata is 0.2 and the activity factor of the control signal is 1/64 of αdata. The total metal resources required to implement Asynchronous design are reduced by 10% for the case of 256 IPs. Asyn- chronous BFT consumes the minimum power and area in comparison to other NoC

140 topologies. A BFT topology could be more power effiecient when the number of IPs is fairly large. In conclusion, Globally Asynchronous Locally Synchronous (GALS) system can offer low power solution for multi-core SoC implementations.

141 CHAPTER 7

Conclusion and Future Work

With the continuous scaling of CMOS technology into the tens of nanometers,

Systems-on-Chip (SoCs) designs have grown in complexity and cost. Chips con- taining hundreds of heterogeneous IP cores with complex functionalities, are now realizable. However, one of the most critical factors for an SoCs economic success in the marketplace still remain its interconnect. The interconnects have a significant impact on SoC costs because it influences four key factors of SoC design: die size, power consumption, design time, and performance. With the data bandwidth de- mands shooting up drastically, Networks-on-Chip have been identified to provide a scalable and efficient routing alternative, promising high communication performance within area and power bounds. NoCs are layered, packet-switched interconnected net- works integrated onto a single chip. IPs and switches are connected in a way, that

IPs are able to communicate through messages. Switches route and buffer packets in between IPs. NoC architectures provide communication infrastructure. In this way it is possible to develop the hardware of resources independently as stand-alone blocks and create the NoC by connecting the blocks as elements in the network. NoC is thus scalable and configurable network that can be adapted to the needs of different workload requirements. In this thesis, many techniques have been investigated and

142 advocated to improve the power and performance of NoC architectures. NoC design problems span the whole SoC spectrum in all domains at all levels. The focus of thesis thus has remained on NoC architectures, NoC network power, performance analysis and refinement. In the following section, a summary of the thesis is given. Then a discussion for the future work is presented.

7.1 Thesis Summary and Contributions

Following the introduction in the first chapter, which includes a general discus- sion of the current bus based interconnections schemes and a brief introduction to emerging interconnection technologies such as NoC, Chapter 2 presents a detailed overview on NoC architectures, performance and associated cost. Selecting the net- work topology is the first step in designing a network because the routing strategy and flow-control method depend heavily on the topology. A topology is selected based on its cost and performance [13]. The cost of NoC is largely determined by the number and complexity of the routers required to realize the network, and the density and length of the interconnections between these routers. Performance has two components: bandwidth and latency. Both of these measures are to a large ex- tent determined by factors other than topology, for example, flow control, routing strategy, and traffic pattern etc. In Chapter 2, four main NoC architectures (Cliche,

BFT, SPIN and Octagon) are discussed in detail along with their performance and cost models. A detailed description on the basic building blocks such as routers, links and network interfaces is presented. NoC switching and NoC routing schemes are

143 presented and its role on the router design is explained. Finally, some high level ar- chitectural parameters for various NoC topologies are presented for static comparison only.

In Chapter 3, a detailed design of the switch architecture is presented and eval- uated in terms of its cost and power for different NoC topologies. A router (or a

Switch) is one of the main building block in a NoC network. Router is responsible for making routing decisions and for forwarding incoming packets to correct outgoing links. The design of router critically impact the performance and cost of the whole network in terms of throughput, latency, power and area. A set of router designs were synthesized using two different technology (90nm and 65nm) nodes using Ca- dence and Synopsys Design environment. The BFT topology consumes the minimum area and power as compared to the other NoC topologies. SPIN is the most power hungry topology among all the others. Some low power design techniques such as power gating and multi-Vt cells have been applied to reduce power consumption of the switch in NoC. The improved switch design reduces the power consumption of the

Butterfly Fat Tree (BFT) architecture by 28%. The effect of this technique on differ- ent NoC architectures is analyzed. The technique reduces the power consumption of the network by upto 41%.

In Chapter 4, the impact of NoC interconnects on the total power consumed by the network is presented. The increasing impact of interconnect scaling on the

NoC performance and power is discussed. Interconnects in any integrated circuits form a complex geometry that introduces capacitive and resistive parasitics. Both of these parasitics have multiple effects on the circuit behavior, such as an increase in propagation delay, or equivalently, a drop in performance and an impact on the energy

144 dissipation and as well as power distribution. An improved design methodology to account for interconnect power in the early stages of design and for generating high quality NoC interconnects is presented. In NoC length of the interconnects is directly related to die size and number of IPs. Longest interconnects require repeater insertion for bandwidth improvements. Theses repeaters add power and area cost to the cost of

NoC. SPIN topology consumes the most significant length of metal resources among all the topologies. Width and space optimization to improve bandwidth can increase the wiring area by as much as 74%. Total power due to repeater insertions is increased by 22% in BFT and by 25% in SPIN architectures. The increase in power in the Cliche architecture is the minimum, only 11%. Interconnects in the Cliche architectures are the shortest and thus require less number of repeaters for performance improvements, making Cliche the most attractive topology for implementation.

In Chapter 5, an efficient design methodology based on NoC architecture and layout based power models is presented to have rough power estimates early in the design cycle. The impact of die size and number of IPs on the power for different NoC architecture is evaluated and presented in a systematic approach. Power aware de- sign is becoming increasingly more important for designs targeted towards low power applications. In addition to extra heat removal costs, high power consumption also reduces the battery life and poses severe reliability issues related to device degrada- tion from high temperatures. The elevation of power to a first-class design constraint requires that power estimations are done at the same time as the performance studies in the design flow. Power can be estimated at a number of levels with varying degree of details. In NoC, one method to have a power aware design is to consider the im- pact of the architectural choices on the power in the early stages of design process.

145 In NoC, an power estimation at architectural level can save ten times as much power

as methods used later in the design flow.

In Chapter 6, a study on the power efficiency between Synchronous and Asyn- chronous NoC architectures is presented. Globally Asynchronous Locally Synchronous

(GALS) design techniques have been suggested as a potential solution in larger and faster SoC designs to avoid the problems related to synchronization and clock skew.

These multi-core SoCs will operate using GALS paradigm, where each core can op- erate in a separate clock domain. Asynchronous logic is an important topic due to its interesting features of high performance, low noise and robustness to parameters variations. However, its performance evaluation and optimization are relatively chal- lenging processes due to the dependencies between concurrent events. To study the effect of synchronous and asynchronous design techniques on the power efficiency of

NoC, an asynchronous switch is implemented using RTL. The relation between αdata

and the efficiency of the Asynchronous design is analyzed. Asynchronous network

is more power efficient than Synchronous architecture for lower range of αdata (the activity factor of data transfer between two switches). The area of the Asynchronous switch is increased by 25% as compared to Synchronous switch, however the power dissipation of Asynchronous BFT architecture is decreased by 27% in comparison to a Synchronous BFT architecture when αdata is 0.2 and the activity factor of the con- trol signal spanning two switches is 1/64 of αdata. The total metal resources required

to implement Asynchronous design are reduced by 10% for the case of 256 IPs. In

conclusion, a Globally Asynchronous Locally Synchronous (GALS) SoC design can

take advantage of additional power savings of the NoC implementation.

146 7.2 Future Work

The results of this thesis provide a sound foundation to continue future research on low power NoC design. Current trends in CMOS technology scaling cannot be reliably sustained without addressing power consumption issues and virtually for all applications, reducing power consumed by SoCs is essential in order to continue to add performance and features to the growth. While the focus of this research has been on the design aspects of NoC, other practical issues such as delay, robustness to noise in Nanometer regime are also important areas to be investigated.

With the technology scaling in nanometer regime, capacitive coupling becomes a major concern leading to issues related to Signal Integrity (SI), the cross-talk prob- lem. The crosstalk is electrical interaction that happens between two or more long nets. The causes of the crosstalk are: long parallel nets, coupling capacitance, high frequency switching and driver strengths. If signals in two neighboring parallel nets change in the same direction then speed-up occurs, and if in the opposite direction then slow-down occurs. Capacitive coupling not only increases the delay on a wire but can also induce coupling noise on that wire when adjacent wires switch. Unlike fringe capacitance and area capacitance, capacitive coupling is a result of the specific wire interactions found in a design and can only be measured once the design has been routed. Coupling can lead to greater wire delays along a path, excessive power consumption due to increased capacitance and glitches, and even functional failures from coupling induced noise causing false switching. SI issues resulting from coupling results in two primary failure modes: timing and functional. In nanometer designs, it is no longer sufficient just to achieve timing closure, a design must also reach SI closure. SI is normally classified into noise and electromigration. SI closure implies

147 that the design is free from SI-related functional problems and meets its timing goals.

In the pre-nanometer design era, designers either ignored SI effects or analyzed and manually repaired them after achieving timing closure. This approach will no longer work for nanometer designs because the number of potential SI violations will exceed what is possible to manage in a post-route analyze and repair methodology. SI closure needs to be managed simultaneously with timing closure. SI closure requires a de- sign to undergo a number of concurrent optimization steps that prevent, analyze and repair SI-induced problems. Without a noise prevention strategy more noise-fixing iterations may be required (after physical implementation), otherwise it may even be necessary to respin the design and delay the tape-out schedule In near future the issues and solution related to Signal Integrity in NoC will be researched.

The other area of interest is in NoC simulators. The fact that NoC technology is still an evolving area, presently there is a shortage of NoC simulators for design and analysis. Although the designer of an interconnection network should have strong intuition in regards to the performance of that network, an accurate simulator is still an important tool for verifying this intuition and analyzing specific design tradeoffs.

The simulators can be made to model the network at the flit level and can support research for multiple topologies and routing algorithms. Buffering, speedup, and pipeline timing of the routers can also be incorporated. Models developed in chapter

4 and chapter 5 can be used for the purpose. Simulator can help advance the design in a shorter time, and saving time of going unnecessary iterations.

The third area of interest is Built-in-self-test (BIST) for NoC. The complexity of

NoC makes the application of traditional test methods completely obsolete. BIST is a testing technique in which the components of a system are used to test the system

148 itself. The motivation to use BIST has arisen particularly from the cost of test pattern generation and from the volume of the test data which have a trend to increase with circuit size, and from the long testing time. When using BIST, it will be possible to test the system at working speed. The main function of BIST are, test generation, test application and response verification. BIST techniques can be classified as, on- line BIST, which occurs during normal function operation and offline BIST, which is used when system is not in its normal working mode. BIST is the main concept for testing the cores in SoC design. Hybrid BIST containing both hardware and software components is probably the most promising approach to test the nodes of NoC [39].

149 APPENDIX A

NoC Examples

NoC is an active research area for future high performance Multi-core SoC designs.

There have been numerous architectural and theoretical studies on NoCs including design methodology, topology exploration, quality-of-service (QoS) , and low power design. They can be grouped into two categories: academic research and industrial solutions. Industrial research shows complete chip implementations and demonstra- tions for specific applications. On the other hand, Academic approaches are mainly regarding design support for next-generation NoC-based products, and software de- velopment to test NoCs. The few examples from both sides are listed below

1. Intel’s 80-Core Tera-FLOP Processor The Teraflop processor chip is the

first generation silicon prototype of tera-scale computing research initiative at

Intel. This program was launched few years ago to handle tomorrow’s advance

applications, which would need a thousand times more processing capability

than is avaialble in today’s giga scale processors. As shown in Figure A.1,

the chip mainly consists of 80 homogenous tiles arranged in an 8 × 10 2-D

mesh topology operating at 5 GHz and 1.2 V [63]. Each tile contains a simple

processing engine (PE) connected to a 5-port router. Each port has two 39-

bit unidirectional point-topoint phase-tolerant mesochronous links, one in each

150 Figure A.1: Intel’s 80 Core Tera Flop Processor

direction. A detailed specification about the tera flop processor is given in

Table A.1. The total data bandwidth of the router is 80 GB/s(4 bytes X 4 GHz

X 5 ports), enabling a bisection bandwidth of 256 GB/s for the network. A router interface block (RIB) handles messages encapsulation between the PE and router, transferring instructions and data between different tiles [36].The router has a 5-port fully non-blocking crossbar and uses wormholes switching for messages. each port or link supports two virtual lanes to avoid deadlock. each lane has 16-entry First-in First-out (FIFO) flit buffer. The router uses a

5-stage pipeline witha 2-stage round-robin arbitration scheme.

151 Table A.1: Intel’s 80-Core Tera Scale Processor Specifications

Technology 65nm CMOS Process Interconnect 1 poly, 8 Metal (Cu) Transistors 100 million Die Area 275mm2 Tile Area 3mm2 Package 1248 pin LGA, 14 layers, 343 signal pins

2. Intel’s 48-Core Single-Chip Cloud Computer Processor The single-chip

cloud computer (SCC) is a prototype chip multiprocessor with 48 Intel’s IA-32

architecture cores organized in a 6x4 2-D mesh interconnect [64]. A two core

cluster is connected to each of the mesh routers in the interconnect fabric and a

total of four memory controllers, two on each side are attached to two opposite

sides of the mesh. Figure A.2 shows some details of the processors. The SCC

Figure A.2: Intel’s 48-Core (24 tiles with two IA cores per tile)SCC Processor

152 architecture supports a messageing-based rather than hardware cache-coherent memory for intercore communication. The SCC sup- ports 8 voltage and 28 frequency domains or islands. The 2-D itself is part of a single voltage and frequency domain. However clock domain crossing FIFOs are used for clock synchronization at the mesh interface with the core cluster- swhich may be running at a different frequency. Additionally SCC router has significant improvements in implementation over Teraflop processor chip [36].

It operates at 2 GHz frequency as compared to 5GHz in the Teraflop design. A detailed specification about the SCC processor is provided in Table A.2. Some

Table A.2: Intel’s 48-Core Single-Chip Cloud Computer Processor Data

Frequency Voltage Power Aggregate Bandwidth Performance 3.16 GHz 0.95 V 62W 1.62 Terabits/s 1.01 Teraflops 5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.63 Teraflops 5.7 GHz 1.35 V 265W 2.92 Terabits/s 1.81 Teraflops

of the other features of this processor´simplementaion include

• 8 virtual Channels (VCs) - 2 reserved VCs for request/response message

classes (MCs), 6 performance VCs

• Flit buffer depth: 24 flits with a dedicated storage of 3 flits per VC

• Virtual cut-through switching

• 3-cycle router pipeline plus a link traversal and buffer write stage( total 4

cycle zero-load latency per hop)

• Determinstic XY routing, implements route precomputation

153 3. Tilera’s Tile Gx Processor TILE-Gx, the latest generation processor family

by Tilera features devices with 16 to 100 identical processor cores (tiles) inter-

connected with Tileras iMesh on-chip network. Each of the cores is a full 64-bit

computing engine with 64-entry register file.

Figure A.3: Tlera’s Multicore Processor

The processors include rich DSP and SIMD extensions, enabling both general-

purpose and multimedia processing in the same device. Each tile consists of a

complete, full-featured processor as well as L1 and L2 cache and a non-blocking

switch that connects the tiles into the mesh. Each tile can independently run

a full , or a group of multiple tiles can run a multi-processing

OS, like SMP Linux. The TILE-Gx family processor slashes board real es-

tate requirements and system costs by integrating a complete set of memory

and I/O controllers, eliminating the need for an external north bridge or south

154 bridge. TileDirect technology provides coherent I/O directly into the tile caches

to deliver ultimate low-latency packet processing performance. Tilera’s DDC

(Dynamic Distributed Cache) system for fully coherent cache across the tile ar-

ray enables scalable performance for threaded and shared memory applications.

The TILE-Gx processors are programmed in ANSI standard C and C++, en-

abling developers to leverage their existing software investment. Tiles can be

grouped in clusters to apply the appropriate amount of horsepower to each ap-

plication. Since multiple, virtualized operating system instances can be run on

the TILE-Gx simultaneously, it can replace multiple CPU and DSP subsystems

for both the data plane and control plane. A detailed specifications of Tilera

processor family is provided in Table A.3.

Table A.3: Tlera’s multicore Processor Data

Tilera Processors Tile Gx3036 Tile Gx3064 Tile Gx3100 General Purpose Cores 36 64 100 mPipe Throughput 30 Mpps 30 Mpps 30 Mpps USB Ports 2 3 3 Total Cache 12 Mbytes 20 Mbytes 32 Mbytes Max DDR3 Speed 1,886 2,133 2,133

4. IBM’s Blue Gene/Q Compute Chip The Blue Gene/Q is the third gener-

ation of IBM Blue Gene line of massively parallel systems. The

aim of the Blue Gene platform remains the same, namely to build a massively

parallel high performance computing (HPC) system out of highly power-efficient

processor chips. As shown in Figure A.4, the heart of a Blue Gene/Q system

155 is its Compute chip, implemented as a System-on-a-Chip (SOC) design. It combines processors, memory hierarchy and network communications on a sin- gle ASIC. Integrating these functions on a single chip reduces the number of chip-to-chip interfaces, thereby reducing power, while increasing performance, reliability and bandwidth. The Blue Gene/Q Compute (BQC) chip is a 19 x

Figure A.4: The Blue Gene/Q SoC integrates 18 homogenous cores

19 mm chip in IBM’s Cu-45 (45nm SOI) technology. The chip functionally contains 18 identical processor cores, intended to be used as 16 user cores, 16 for user applications, 1 for operating system services, and 1 redundant core as extra insurance against yield fallout in this complex SoC (system on chip). The processor core is mostly the same as the IBM’s A2 processor core. The cores run the 64-bit Power instruction set architecture, and are operated at 1.6GHz with a 0.8 volt supply, though the design is capable of operation at 2.4 GHz. The

156 design team scaled back voltage and clock frequency in order to reduce active

power consumption and leakage. Each processor core on the Blue Gene/Q has

a dedicated quad FPU (floating point unit), a 4-wide double precision SIMD

(single instruction multiple data) architecture which can support 8 concurrent

operations. The processors share a central 32MB DRAM L2 cache, which is

unique in supporting , speculative execution, and atomic

operations. In addition, each processor core interfaces, via a sophisticated L1-

prefetching unit and a crossbar switch, to a 32 MB central L2 cache, which

uses embedded DRAM for data storage. The L2 cache allows for the storage of

multiple data versions per address. The BQC on-chip networking logic supports

10 bidirectional 2GB/s links to neighboring chips, allowing the chips to be in-

terconnected into a high-bandwidth, low-latency 5-D torus network, as well as

providing for an additional IO link. Peak performance for the chip was specified

at 204.8 GFLOPS with 55w power dissipation. Blue Gene/Q is currently under

development by IBM and is not yet generally available

5. KAIST BONE Series For the unique purpose of realizing the new NoC tech-

nology through implementation, the BONE(Basic On-Chip Network) project

was launched in 2002 at KAIST(Korea Advanced Institue of Science and Tech-

nology, Daejeon Korea). As shown in Figure A.5, through this project different

NoC techniques have been published and demonstrated since then.

BONE-1:Prototype of On-Chip Network (PROTON) To demonstrate

feasibility of the on-chip network (OCN) architecture, a test chip using 0.38 um

technology was developed by KAIST group. The BONE1 is designed with two

physical layer features: high speed (800 MHz) mesochronous communication,

157 Figure A.5: BONE Evolution

and on-chip serialization. It make use of a 4:1 serialization, 80b packets are transferred through 20b links. The 4:1 serialization reduces the network area of BONE1 by 57 percent, which makes it suitable for SoC design. The dis- tributed NoC blocks are not globally synchronized, and the packet transfer is performed with mesochronous communication. Because mesochronous commu- nication eliminates the burden of global synchronization, high speed clocking

800 MHz is possible. The implementation and successful measurement demon- strate that high-performance on-chip serialized networking with mesochronous communication is practically feasible. Table3 shows some of the specifications of this chip.

BONE-2:Low-Power NoC and Network-in-Package(SLIM Spider) As mentioned previously, in large scale SoCs, power consumption of the communi- cation infrastructure should be minimized for reliable, feasible, and cost-efficient

158 chip implementations. In the BONE2 project at KAIST, a hierarchically star- connected NoC is designed and implemented with various low-power techniques.

The fabricated chip contains heterogeneous IPs such as two RISC processors, multiple memory arrays, FPGA, off-chip network interfaces, and 1.6 GHz PLL.

The integrated OCN provides 89.6 Gbps aggregate bandwidth and consumes mW at full traffic conditions. On the other hand the previous project BONE-1, consumes 264 mW with 51 Gbps bandwidth. The ratio of power consumption to bandwidth is 10x less than that of BONE1.

BONE-3:Flexible On-Chip Network(FONE) BONE3 utilizes the wave- front-train (WAFT) scheme for high speed serialization. To stabilize the WAFT operation against power supply voltage variation, an adaptive reference-voltage generation according to supply voltage variation is realized. FONE is an FPGA based NoC evaluation board implementation. The emulation platform offers opportunities to test a sufficient range of choices of NoC design parameters as well as IPs for various applications with a very fast execution time. The NoC evaluation board is implemented on three Altera Stratix EP160 series FPGAs to explore and evaluate a wide range of NoC solutions. The implemented sys- tem has various IPs: two masters (RISC CPUs and LCD controller) and four slaves (3-D graphics processor, SRAM, Flash and UART). The integrated NoC uses OCS technique to reduce the network area significantly.

BONE-4:Reconfigurable OCN BONE4 also make use of reconfigurable net- work module on chip. More specifically it has programmable arbitration and reconfigurable switches. It offers low latency and the design is fully synthesiz- able.

159 6. Flexible Architecture of Unified System For Telecom(FAUST) FAUST

was launched by eleven european industrial research institutes and universities

as a joint project named 4-MORE: 4G Multi-carrier CDMA multiple antenna

System-on-Chip for Radio Enhancements. Recently their architecture has been

published for the application of multicarrier OFDM-based baseband processing,

such as 802.11n, 802.16e, and 3GPP/LTE. They proposed the Asynchronous

Network on Chip (ANoC) with GALS (Globally Asynchronous and Locally

Synchronous)design strategy [65][66]. Figure A.6 shows FAUST chip archi-

tecture. The ANoC architecture uses virtual channels to provide low latency

Figure A.6: FAUST Chip Architecture

and QoS, which is implemented in quasi delay-insensitive asynchronous logic.

The FAUST chip integrates 20 asynchronous NoC routers, 23 synchronous units

160 including an ARM946 core, embedded memories, various IP blocks, reconfig- urable datapath engines, and one clock management unit to generate the 24 distinct unit clocks. A real time 100 Mbps SISO OFDM transceiver needs a bandwidth of 10 Gbps which corresponds to a 10 percent network load. The

20 node NoC will represent about 15 percent of the overall area, and average complexity of the 23 IPs connected is close to 300K gates (including RAM).

The specification of the FAUST project are listed in Table4

161 APPENDIX B

Abbreviation

AMBA : Advanced Microcontroller Bus Architecture

ANoC : Asynchronous Network-on-Chip

ASIC : Application Specific

BFT : Butterfly Fat Tree

CLICHE : Chip Level Integration of Hetrogenous Elements

CMOS : Complementary Metal Oxide Semiconductor

DC : Design Compiler

EDA : Electronic Design Automation

FIFO : First in, first out

Flit : Flow Control Unit

FPA : Fixed Priority Arbiter

GALS : Globally Asynchronous and Locally Synchronous

HDU : Header Decoder Unit

HT : High Throughput

IP : Intellectual Property

162 LC : Link Controller

LRS : Least Recent Served

MPSoC : Multi Processor System-on-Chip

NDS : Nanometer Design Space

NoC : Network on Chip

NIC : Network Interface Controller

OCP : Open Core Protocol

PE : Processing Element

PHIT : Physical Unit

PT : Prime Time

QoS : Quality of Service

RTL : Register Transfer Level

SAF : Store and Forward

SAIF : Switching Activity Interchange Format

SDF : Standard Delay Format

SoC : System-on-Chip

SPIN : Scalable Programmable Integrated Network

ULSI : Ultra Large Scale Integration

VC : Virtual Channels

VCT : Virtual Cut Through

VLSI : Very Large Scale Integration

VPA : Variable Priority Arbiter

WH : Wormhole Switching

163 BIBLIOGRAPHY

[1] Sillicore Open Cores and ORSoC. “WISHBONE System-on-Chip (SoC) Inter- connection Architecture for Portable IP Cores”. Wishbone B4, Open Cores, pages 2–128, 2010.

[2] S. Augarten. “State of the art: A photographic history of the integrated circuit”. “Ticknor & Fields”, 1983.

[3] Open Cores and Sillicore. Wishbone system-on-chip (soc) interconnection archi- tecture for portable ip cores. Revision: B.3, opencores.org, pages 7–92, 2002.

[4] Semiconductor Industry Association. “The International Technology Roadmap for Semiconductors: 2009”, pages 3–10, 2009.

[5] Semiconductor Industry Association. “The International Technology Roadmap for Semiconductors”, 2007.

[6] K. Chandershaker. “Performance Validation of Networks on Chip”. MS. Thesis: Delft University of Technology), pages 105–112, 2009.

[7] M. Mitic and M. Stojcev. “An Overview of On-Chip Buses”. “Journal of Elec- trical Energy”, 19(3):405–428, December 2006. <5>

[8] ARM Inc. “ARM AMBA Specifications V2.0”. Arm Online Document Available at http://www.arm.com, 1999. <5>

[9] ARM Inc. “ARM Multi-Layer AHB Overview”. Arm Online Document Available at http://www.arm.com. <5>

[10] S. Pasricha and N. Dutt. “On-Chip Communication Architectures - System On Chip Interconnect”. Morgan Kaufmann Publishers, First edition, 2008. <20, 25, 33, 38, 92>

[11] P. Pande A. Jaantsch E. Saliminen U. Ogras C. Grecu, A. Ivanov and R. Mar- culescu. “Towards Open Network-on-Chip Benchmarks”. Proceedings of the IEEE First International Symposium on Network-on-Chips (NOCS), page 205, 2007. <21>

164 [12] C. Agarwal, A. Iskander and R. Shankar. “Survey of Network on Chip (NoC) Architectures and Contributions”. Journal of Engineering, Computing and Ar- chitecture, 2009. <23>

[13] W. Dally and B. Towels. “Principles and Practices of Interconnection Networks”. Morgan Kaufmann Publishers, First edition, 2007. <24, 143>

[14] J. Soininen M. Forsell M. Millberg J. Oberg K. Tiensyrja S. Kumar, A. Jantsch and A. Hemani. “A Network-on-Chip Architecture and Design Methodology”. In Preceedings of IEEE Computer Society annual Symposium on VLSI), pages 105–112, 2002. <27, 35, 135>

[15] W. Dally and B. Towels. “Route Packets, Not Wires: On-Chip Interconnection Networks”. Proceedings of the Design Automation Conference, 38:684–689, June 2001. <28>

[16] A. Ivanov P. Pande, C. Grecu and R. Saleh. “Design of a Switch for Network- on-Chip Applications”. Proceedings of the IEEE International Symposium on Circuits and Systems, pages 217–220, May 2003. <29, 85>

[17] P. Guerrier and A. Greiner. “A Generic Architecture for On-Chip Packet Switched Interconnections”. Proceedings of the Design Automation and Test in Europe Conference and Exhibition, pages 250–256, March 2000. <30, 107, 135>

[18] A. Nguyen F. Karim and S. Dey. “An Interconnect Architecture For Networking Systems on Chips”. IEEE Micro, 22(5):36 – 45, 2002. <30, 109, 135>

[19] I. Cidon R. Ginosar E. Bolotin, A. Morgenshtein and A. Kolodny. “Automatic Hardware-Efficient SoC Integration by QoS Network on Chip”. IEEE Interna- tional conference on Circuits and Systems(ICECS), pages 479–482, 2004. <32>

[20] D. Bertozzi and L. Benini. “Xpipes: A Network-on-Chip Architecture for Gi- gascale Systems-on-Chip”. IEEE Circuits and Systems Magazine, 4(2):18 – 31, 2004. <32>

[21] C. Zeferino and A. Susin. “SoCIN: A Parametric and Scalable Network-on- Chip”. IEEE International conference on Integrated Circuits and Systems De- sign(SBCCI), pages 169–174, 2003. <32>

[22] G. Rauwerda P. Wolkotte, G. Smit and L. Smit. “An Energy Efficient Reconfig- urable Circuit Switched Network-on-Chip”. In Preceedings of the 19th IEEE In- ternational Parallel and Distributed Processing Symposium(IPDPS), 2005. <34>

[23] G. Micheli and L. Benini. “Networks on Chip: Technology and Tools (Systems on Silicon)”. Morgan Kaufmann Publishers, First edition, 2006. <37>

165 [24] S. Carta L. Raffo D. Bertozzi S. Stergiou, F. Angiolini and G. DeMicheli. “xPipes Lite: A Synthesis Oriented Design Library for Networks on Chips”. In Preceed- ings of the IEEE Design Automation and Test in Europe Conference.(DATE)), pages 1188–1193, 2005. <38>

[25] A. Singh N. Borkar Y. Hoskote, S. Vangal and S. Borkar. “A 5-GHz Mesh Interconnect for a Teraflop Processor”. IEEE Micro, 27(5):51–61, 2007. <38>

[26] M. Jones A. Ivanov P. Pande, C. Grecu and R. Saleh. “Performance Evalu- ation and Design trade-Offs for Network-on-Chip Interconnect Architectures”. Proceedings of the IEEE Transactions on Computers), 54(8):1025–1040, August 2005. <41, 42>

[27] J. Culler, D. Singh and A. Gupta. “Parallel : A Hard- ware/Software Approach”. Morgan Kaufmann Publishers Inc., First edition, 1999. <44>

[28] O. Gangwal S. Pestana A. Radulescu K. Goossens, J. Dielissen and E. Rijpkema. “A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SoC Design and Verification”. Proceedings of the IEEE conference on Design, Automation and Test in Europe (DATE), pages 1182–1187, 2005. <45>

[29] S. Murali R. Tamhankar S. Stergiou L. Benini D. Bertozzi, A. Jalabert and G. DeMicheli. “NoC Synthesis Flow for Customized Domain Specific Multipro- cessor Systems-on-Chip”. Proceedings of the IEEE Transaction on Parallel and Distributed Systems, 16(2):113–129, 2005. <45>

[30] J. Soininen M. Forsell M. Millberg J. Oberg K. Tiensyrja S. Kumar, A. Jantsch and A. Hemani. “A Network on Chip Architecture and Design Methodol- ogy”. Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 105–112, 2002. <45>

[31] R. Mullins S. Moore A. Banerjee, P. Wolkotte and G. Smit. “An Energy and Performance Exploration of Network-on-Chip Architectures”. IEEE Transaction on Very Large Integartion (VLSI) Systems, 17(3):319–329, March 2009. <48>

[32] S. Lee K. Lee and H. Yoo. “Low Power Network-on-Chip for High Performance SoC Design”. Proceedings of the IEEE Transcations on Very large Scale Integra- tion (VLSI) Systems), 14(2):148–160, February 2006. <49>

[33] S. Lee and N. Bagherzadeh. “A High Level Power Model for Network-on-Chip (NoC) Router”. An International Journal on Computers and Electrical Engi- neering by Elsevier, 35(6):837–845, November 2009. <49, 103>

166 [34] P. Veanki N. Banerjee and K. S. Chatha. “A Power and Performance Model for Network-on-Chip Architectures”. Proceedings of the IEEE Design, Automation and Test Conference in Europe (DATE)), ’2. <49>

[35] Neil H. Weste and David M. Harris. CMOS VLSI DESIGN: A Circuits and Systems Perspective. Pearson Higher Education, Fourth edition, 2011. <53, 55, 91, 101>

[36] J. Flich and D. Bertozzi. “Designing Network-on-Chip Architectures in the Nanoscale Era”. CRC Press: A Chapman & Hall Book, First edition, 2011. <55, 151, 153>

[37] J. Henkel and S. Parameswaran. “Designing Embedded Processors: A Low Power Perspective”. Springer, First edition, 2004. <59>

[38] A. Devgan A. Ramalingam, B. Zhang and D. Pan. “Sleep Transistor Sizing Using Timing Criticality and Temporal Currents”. Proceedings of the IEEE Design and Automation (DAC) Conference, 2:1094–1097, January 2005. <67>

[39] A. Jantsch and H. Tenhunen. “Networks on Chip”. Kluwer Academic Publishers, First edition, 2003. <83, 149>

[40] K. Chang. “Reliable Network-on-Chip Design for Multi-Core System-on-Chip”. The Journal of Supercomputing, 55(1):86–102, 2011. <84>

[41] N. Pazos Y. Leblebici S. Murali D. Atienza I. Hartirnaz, S. Badel and G. DeMicheli. “Early Wire Characterization for Predictable Network-on-Chip Global Interconnects”. Proceedings of the 2007 International Workshop on Sys- tem Level Interconnect Prediction (SLIP’07), pages 57–64, 2007. <84>

[42] M. Carta L. Raffo F. Angiolini, P. Meloni and L. Benini. “A Layout-Aware Analysis of Network-on-Chip and Tradional Interconnects for MPSoCs”. IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, 26 (3):421–434, March 2007. <84>

[43] A. Ivanov C. Grecu, P. Pande and R. Saleh. “Timing Analysis of Network on Chip Architectures for MP-SoC Platforms”. Elsevier Microelectronics Journal, (36):833 – 845, 2005. <86, 93>

[44] A. Balakrishnan and A. Naeemi. “Optimal Global Interconnects for Network- on-Chip in Many Core Architectures”. Proceedings of the IEEE Electron Device Letters), 31(4), April 2010. <87>

[45] K. Banerjee and A. Mehrotra. “A Power-Optimal Repeater Insertion Method- ology for Global Interconnects in Nanometer Designs”. IEEE Transaction on Electronic Devices, 49(11):2001–2007, November 2002. <93>

167 [46] A. Kolodny I. Cidon and R. Ginosar. “Low-leakage Repeaters for NoC Inter- connects”. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 1:600–603, 2005. <93>

[47] K. Daewook and G.E. Sobelman. “Network-on-Chip Link Analysis under Power and Performance Constraints”. Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), May 2006. <93>

[48] H.-F. Huang X.-C. Li, J.-F. Mao and Y. Liu. “Global Interconnect Width and Spacing Optimization for Latency, Bandwidth and Power Dissipation”. IEEE Transactions on Electron Devices, 52(10):2272–2279, October 2005. <93, 95, 113, 135>

[49] U. Weiser N. Magen, A. Kolodny and N. Shamir. “Interconnect Power Dissipa- tion in a Microprocessor”. Proceedings of the International Workshop on System Level Interconnect Prediction (SLIP’04), pages 7–13, 2004. <95>

[50] G. Reehal and M. Ismail. “Layout-Aware High Performance Interconnects for Network-on-Chip Design in Deep Nanometer Technologies”. IEEE 6th Interna- tional Design and Test Workshop (IDT), pages 58–61, 2011. <97>

[51] W. EL-Kharashi H. Elmiligi, A. Morgan and F. Gebali. “Power Optimization for Application-Specific Network-on-Chips: A Topology-based Approach”. Jour- nalof Microprocessors & Microsystems, 33(5-6):345–355, August. 2009. <104>

[52] J. Howard-S. Dighe N. Borkar S. Vangal, A. Singh and A. Alvandpour. “A 5.1 GHz 0.34mm Router for Network-on-Chip Applications”. IEEE Symposium on VLSI Circuits Digest of Technical Papers, pages 42–43, 2007. <114>

[53] A. Valentian Y. Thonnart, E. Beigne and P. Vivet. “Power Reduction of Asyn- chronous Logic Circuits using Activity Detection”. IEEE Transaction on Very Large Scale Integration Systems, 17(7):893–906, July 2009. <118>

[54] R. Ginosar R. Dobkin and C. P. Sotiriou. “High Rate Data Synchronization in GALS SoCs”. Proceedings of the IEEE Transactions on Very Large Scale Integration Systems), 14(10), October 2006. <118>

[55] M. Greenstreet B. Quinton and S. Wilton. “Practical Asynchronous Interconnect Netwrk Design”. IEEE Transcation on Very Large Scale Integration (VLSI) Systems, 16(5):579–588, May 2008. <122>

[56] M. Kim D. Kim and G.E. Sobelman. “Asynchronous FIFO Interfaces for GALS On-Chip Switched Networks”. Proceedings of the IEEE Symposium on System- on-Chip (SoC’05)), pages 186–189, 2005. <122>

168 [57] T. Chelsea and S. Nowick. “Robust Interfaces for Mixed-Timing Systems”. IEEE Transaction on Very Large Scale Integration Systems, 12(8):857–873, August 2004. <123>

[58] M. Theobald-N. Jha B. Lin R. Fuhrr, S. Nowick and L. Plana. “MINIMAL- IST: An Environment for Synthesis, Verification and Testability of Burst-Mode Asynchronous Machines”. Columbia University, Dept. of Computer Scinec, New York, CUCS-020-99, 1999. <124>

[59] R. Fuhrer and S. Nowick. “Sequential Optimization of Asynchronous and Syn- chronous Finite-State Machines: Algorithms and Tools”. Norwell, MA: Kulewer, CUCS-020-99, 2001. <124>

[60] S. Nowick and D. Dill. “Synthesis of Asynchronous State Machines using a Lo- cal Clock”. Proceedings of IEEE International Conference on Computer Design (ICCD’91), October:192–197, 1991. <124>

[61] S. Nowick. “Automatic Synthesis of Burst-Mode Asynchronous Controllers”. Proceedings of IEEE International Conference on Computer Design (ICCD’91), Stanford University Tech. Report(CSL-TR-95-686), 1993. <124>

[62] D. Korzec M. El-Ghany, G. Reehal and M. Ismail. “Power Analysis for Asyn- chronous Cliche Network-on-Chip”. Proceedings of the IEEE System-on-Chip Conference (SOCC), pages 499–504, 2010. <136>

[63] G. Ruhl S. Dighe H. Wilson J. Tschanz D. Finan A. Singh T. Jacob S. Jain V. Erraguntla. C. Roberts Y. Hoskote N. Borkar S. Vangal, J. Howard and S. Borkar. “An 80-Tile Sub-100-W TeraFLOPS Processor in 65nm CMOS”. Proceedings of the IEEE Journal of Solid State Circuits), 43(1):29–41, January 2008. <150>

[64] Y. Hoskote S. Vangal S. Finnan G. Ruhl D. Jenkins H. Wilson N. Borkar J. Howard, S. Dighe. “A 48-core IS-32 Message Passing Processor with DVFS in 45nm CMOS”. Proceedings of the IEEE International Solid State Circuit Conference, pages 58–59, 2010. <152>

[65] F. Clemidy E. Beigne C. Bernard Y. Durand J. Durupt P. Vivet, D. Lattard and D. Varreau. “A Telecom Baseband Circuit Based on an Asynchronous Network- on-Chip”. IEEE International Solid State Circuits Conference (ISSCC). Digest of Technical Papers, pages 258–601, 2007. <160>

[66] P. Vivet I. Miro-Pandes, F. Clermidy and A. Greiner. “Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture”. ACM/IEEE Inter- national Symposium on Network-on-Chip (NoCs), pages 139–148, 2008. <160>

169