HARDWARE ACCELERATORS FOR VLSI GLOBAL ROUTING

A Thesis

Presented to The Faculty of Graduate Studies

of

The University of Guelph

by

MAHDIELGHAZALI

In partial fulfilment of requirements

for the degree of

Master of Science January, 2009

© Mahdi Elghazali, 2009 Library and Bibliotheque et 1*1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A0N4 Canada Canada

Your file Votre reference ISBN: 978-0-494-47764-9 Our file Notre reference ISBN: 978-0-494-47764-9

NOTICE: AVIS: The author has granted a non­ L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a la Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par Plntemet, prefer, telecommunication or on the Internet, distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non­ sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni la these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne Privacy Act some supporting sur la protection de la vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ont ete enleves de cette these.

While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. Canada ABSTRACT

HARDWARE ACCELERATORS FOR VLSI GLOBAL ROUTING

Mahdi Elghazali Advisor:

University of Guelph, 2009 Dr. Shawki Areibi

This thesis investigates three different approaches to enhance the performance of the global routing step in the physical design process. The first approach is based on a hardware/software co-design strategy, while the second is a custom hardware implementation using Handel-C [1]. An application specific instruction implementation is also implemented and investigated. This approach targets the

Tensilica configurable processor. The experimental results show that the three approaches produce the same quality solutions as the pure-software implementation.

However, the co-design approach achieves an average speedup of 4.3x over the pure- software based approach, while the custom hardware approach achieves an average speed up of 3.9x. The configurable approach obtained an average speedup of 33.6x over the pure software, while achieving a speedup of 7.81x and 8.61x over the hardware/software co-design and the custom hardware respectively. I hereby declare that I am the sole author of this thesis.

I authorize the University of Guelph to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the University of Guelph to reproduce this thesis by photo­ copying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

1 The University of Guelph requires the signatures of all persons using or photo­ copying this thesis. Please sign below, and give address and date.

n Acknowledgments

I would like to take this opportunity to express my sincere appreciation and thanks to my supervisor professor Shawki Areibi for his great guidance and assistance, and for the help he provided throughout this Master program. Many thanks to professor

Radu Muresan and professor Gary Grewal for reviewing this thesis. I would like also to thank Adam Erb and Jon Spenceley for their help in this work.

I want to especially thank my father, my mother, my brothers and sister for their continuous encouragement and support.

And finally, many thanks to all my friends. Special thanks to Ahmed Saghaier and Ahmed Elhossini, I really enjoyed the time we spent together. Thanks to all the people who helped me by any means.

m To my family for their support and encouragement.

iv Contents

1 Introduction 1

1.1 Motivation 2

1.2 Overall Methodology 4

1.3 Contributions 5

1.4 Thesis Organization 6

2 Background 7

2.1 VLSI Design Process 8

2.1.1 VLSI Physical Design Automation 9

2.2 Global Routing 11

2.2.1 Routing Problem Definition 12

2.2.2 A Classification of Global Routing Algorithms 12

2.3 Maze Routing Algorithms 13

2.3.1 Lee's Algorithm 14

2.3.2 Limitations of Lee's Algorithm for Large Circuits 14

2.3.3 Reducing the Running Time 15

2.4 Reconfigurable Computing Systems 17

v 2.4.1 Hardware/Software Co-design in RCS 17

2.4.2 Field-Programmable Gate Arrays (FPGAs) 19

2.5 Application Specific Instruction-set Processors 22

2.5.1 Tensilica Configurable Processors 22

2.6 Benchmarks 23

2.7 Summary 25

3 Literature Review 26

3.1 Placement Based Hardware Accelerators 28

3.2 Accelerators for FPGA Routers 31

3.2.1 Distributed Workstations 31

3.2.2 Pure Hardware Accelerators 32

3.3 Accelerators for ASIC Routers 34

3.3.1 General Purpose Processors 34

3.3.2 ASIC-Based Implementations 37

3.3.3 FPGA-Based Implementations 41

3.4 Summary 44

4 Hardware/Software Co-design 46

4.1 Methodology 46

4.2 Design Flow of Lee's Algorithm 48

4.3 A Pure-software Based Implementation 49

4.3.1 Implementation on a MicroBlaze System 49

4.3.2 Major Software Functions 51

4.3.3 Multi-Terminal Nets Routing 57

vi 4.3.4 Profiling 58

4.3.5 Framing Technique 58

4.4 A Hardware/Software Co-Design Implementation 59

4.4.1 Fast Simplex Link (FSL) Bus 61

4.4.2 The Hardware Accelerator Module 63

4.5 Results 67

4.5.1 FPGA Usage 68

4.5.2 Speedup 68

4.6 Summary 71

5 A Handel-C Custom RTL Implementation 72

5.1 DK Design Flow 73

5.2 Design Constraints 74

5.3 Design Details 74

5.3.1 Parallelizing Lee's Algorithm 74

5.3.2 Input/Output Data 77

5.4 The Custom Hardware vs. The MicroBlaze Based Implementations 77

5.4.1 Speedup 77

5.4.2 FPGA Usage 79

5.5 Summary 80

6 Configurable Processors Implementation 81

6.1 Tensilica Configurable Processors 82

6.1.1 Xtensa Processors 82

6.1.2 Design Flow 83

vii 6.2 Design Details 84

6.2.1 Design Environment and Overall Architecture 85

6.2.2 Profiling 86

6.3 Results 87

6.3.1 Speed and Area 87

6.4 Overall Comparison 88

6.4.1 Speedup 88

6.4.2 Area 90

6.5 Summary 90

7 Conclusions 92

7.1 Future Work 93

Bibliography 95

A Glossary 100

B AMIRIX AP1000 FPGA PCI Development Board 102

C RC10 104

D The Netlist and the Placement Files 106 D.l The Netlist File 106

D.2 The Placement File 106

vm List of Tables

2.1 Benchmarks 24

3.1 Comparison between the three placement architectures 31

3.2 PE Commands . 44

4.1 The Profiling Results 59

4.2 The FPGA Usage 68

4.3 The Consumed Clock Cycles and the Maximum Operating Frequency 69

4.4 The Obtained Speed up over pure-software 70

5.1 The Consumed Clock Cycles 78

5.2 The Actual Execution Time of the Three Implementations in mili Sec. 78

5.3 The FPGA Usage 80

6.1 Xtensa Processor Configuration Detail 85

6.2 The Profiling Results 86

6.3 The Consumed Clock Cycles and the Speed up Obtained over the

Pure ISA Processor 87

6.4 The Consumed Clock Cycles 89

ix 6.5 The Actual Execution Time of the Three Approaches in m Sec. . . 89

6.6 The Speed up obtained by Tensilica Approach over the H/S and

Handel-C Approaches 90

x List of Figures

1.1 Interconnect and Gate Delay 3

1.2 The Overall Design Methodology 5

2.1 The VLSI Design Process 8

2.2 VLSI Physical Design Cycle 9

2.3 An Illustration of General Routing 12

2.4 The Classification of the Global Routing Algorithms 13

2.5 Lee's Algorithm: (a) The Wave Propagation Phase (b) The Retrace

Phase (c) The Clean up Phase 15

2.6 Schemes to Reduce the Running Time of Lee's Algorithm, (a) Start­

ing point selection, (b) Double fan-out. (c) Framing 16

2.7 A General FPGA Structure 20

2.8 A General Configurable Logic Block[2] 20

2.9 The Different Computing Approaches 24

3.1 Hardware Accelerators for CAD 27

3.2 The model of the partially reconfigurable dynamic system [3] . . . . 28

3.3 The Serial Architecture [3] 29

xi 3.4 The Parallel Architecture [3] 30

3.5 The Serial Parallel Architecture [3] 30

3.6 HSRA T-Switch with Path-Search OR [4] 33

3.7 Maze Router General Architecture and Pipelined Processors .... 35

3.8 Basic structure of the wavefront machine 38

3.9 Block diagram of a single PE 40

3.10 L3 General Organization [5] 42

3.11 L4 Architecture [6] 43

4.1 The Design Methodology 47

4.2 The Flow Chart of Lee's Algorithm 48

4.3 The MicroBlaze System for the Pure-software Based Implementation 50

4.4 Assign the Source and the Target 51

4.5 The Wave Propagation Function 53

4.6 Retrace and Clean up Function 54

4.7 The Rip up Function 56

4.8 The Rip up Function Steps 57

4.9 Multi-Terminal Nets Routing 58

4.10 The Wave Propagation Function with Framing Technique 60

4.11 Framing 1 Technique 61

4.12 The MicroBlaze System for the Hardware/Software Co-design ... 62

4.13 FSL Block Diagram [7] 62 4.14 The Grid Partitioning 63

4.15 The Architecture of the Hardware Accelerator 64

xn 4.16 The Architecture of the First Column in the Grid 65

4.17 FSM Operation Procedure 67

4.18 The Obtained Speed up 70

5.1 The DK Design Flow 73

5.2 Variable Assignment (a) 75

5.3 Variable Assignment (b) 75

6.1 Xtensa Design Flow [8] 84

B.l Amirix AP1000 FPGA PCI Developmemnt Board [9] 102

C.l The RC10 Board [10] 104

D.l The Netlist File 107

D.2 The Placement File 107

xm Chapter 1

Introduction

Since the development of Integrated Circuits (IC's) in the 1960's, the fabrication technology of ICs has improved from being able to integrate just few transistors in

Small Scale Integration (SSI) to several millions of transistors in Very Large Scale

Integration (VLSI). The size of a transistor is shrinking every eighteen months ac­ cording to Moore's law [11]. Accordingly, the complexity of the integrated circuits will continue to increase dramatically and the integration of several billions tran­ sistors will be realized soon. Consequently, the fabrication process of future VLSI will continue to be an increasingly challenging issue. Therefore, it is important to create faster techniques to reduce the time that the VLSI physical design automa­ tion step takes to complete processing. VLSI physical design automation is the process of converting a specific circuit design into a layout. This process consists of fioorplanning, partitioning, placement, and routing [12].

Routing is one of the most time-consuming steps in the VLSI design process, and it takes up 30 % of the design time. One of the most popular algorithms for

1 CHAPTER 1. INTRODUCTION 2 finding a path between any two vertices on a planar graph -the primary sub-problem in VLSI routing- is Lee's algorithm [13] [12]. As circuits increase in size, algorithms such as Lee's algorithm [13] become inherently slow.

One of the potential techniques for reducing the running time of the design process is implementing the algorithms on Application Specific Integrated Circuits

(ASICs). ASICs are ideal for a particular function, taking advantage of parallelism and pipelining. However, once an ASIC chip has been fabricated, no further modi­ fications are possible on the chip. Therefore, the inflexibility of ASICs causes high development costs.

In contrast, the hardware configuration of the Reconflgurable Computing Sys­ tems (RCS) could be modified in response to changing designs, and achieve poten­ tially high performance systems, while maintaining the flexibility of the software.

The main component of RCS is the Field-Programmable Gate Array (FPGA).

Present FPGAs have the capability to accommodate complex algorithms into a single FPGA chip. One of the goals of this dissertation is to design hardware accelerators using FPGAs to speedup the routing phase in VLSI circuit layout.

1.1 Motivation

As the size of the transistors is shrinking and the average wirelength is increasing, due to the increase of the design complexity, this make interconnect capacitance

Cwire to be more become dominant over gate capacitance Cgate. Consequently, the interconnection delay consumes a major part of the clock cycle today and is expected to maintain the same trend in the future [14], as illustrated by Figure 1.1. CHAPTER 1. INTRODUCTION 3

1.0

! interconnect delay /

gate delay ^""~\^

Figure 1.1: Interconnect and Gate Delay

Lee's routing algorithm [13] has proved to be an efficient technique for VLSI circuit routing, and it is widely used for many of the physical design automation steps [12]. As the complexity and the size of the digital circuits increases, algorithms such as Lee's are becoming slow and less efficient. Having an efficient and fast routing algorithm can be integrated within the placement phase leading to less congested module placement and therefore routable designs.

Accordingly, the main goal of this research is to speed up the performance of

Lee's routing algorithm. The methodology based on different hardware accelerators will no doubt enhance the performance of CAD tools in general and VLSI routing in particular. CHAPTER 1. INTRODUCTION 4

1.2 Overall Methodology

The overall design methodology that was employed to accelerate Lee's algorithm is shown in Figure 1.2. A pure software implementation of Lee's routing algorithm is developed (using the C language) on a general purpose processor. First, Lee's algorithm is implemented on an FPGA using two approaches a hardware/software co-design approach, which utilizes a MicroBlaze soft core, and a custom hardware implementation using Handel-C [1]. An application specific instruction set proces­ sor is also investigated. This approach targets the Tensilica configurable processor.

In the first approach, a pure-software implementation of Lee's algorithm that runs on a MicroBlaze is profiled to determine the bottlenecks (time-consuming parts).

The bottlenecks are then implemented in hardware using VHDL. Lee's routing al­ gorithm is implemented on an FPGA chip using a hardware/ software co-design approach based on a MicroBlaze soft processor and a dedicated hardware accelera­ tor. In the second approach, the pure-software version of Lee's routing algorithm is converted into Handel-C [1]. The initial Handel-C implementation (PSH) is purely sequential. Consequently, parallelism is further exploited to the original implemen­ tation to improve its performance (PPH).

In the third approach, the pure software implementation of Lee's algorithm that runs on a Tensilica ASIP is profiled to identify the main hot spots. The Tensil­ ica instruction extension (TIE) language is used to convert the bottlenecks into a hardware accelerator integrated within the data path of the processor to enhance the performance. Finally, the performance of the hardware/software co-design, the custom hardware, and the Tensilica ASIP approaches are compared and evaluated. CHAPTER 1. INTRODUCTION

Software Implementation on General Purpose Processor

1 r w Software Implementation on Software Implementation on MicroBlaze Tensilica ASIP

1' v " Profiling A Pure Sequential Hardware Profiling Implementation Using Handel-C (PSH) 1f " Identify Bottlenecks Identify Bottlenecks

ir

WS Co- Design A Custom Hardware Configurable Approach Approach Processor Approach

' '

Figure 1.2: The Overall Design Methodology

1.3 Contributions

The main contributions of this thesis can be summarized as following:

• Investigate several approaches with different coupling to accelerate Lee's rout­

ing algorithm. CHAPTER 1. INTRODUCTION 6

• Explore the performance of a hardware/software co-design over a pure soft­

ware implementation running on a general processor.

• Explore the performance of the Tensilica ASIP processor against the hard­

ware/software co-design based on MicroBlaze soft core and the custom hard­

ware using Handel-C.

• Propose a new technique for the wave propagation phase of Lee's algorithm.

• Results obtained were submitted for publication in the Reconfigurable Ar­

chitectures Workshop (RAW09) [15] and also the Canadian Conference on

Electrical and Computer Engineering (CCECE09) [16].

1.4 Thesis Organization

The reminder of the thesis is organized as follows: Chapter 2 introduces necessary- background related to the VLSI design process, physical design automation, global routing and reconfigurable computing systems. Chapter 3 presents an overview on previous efforts in developing accelerators for VLSI physical design automation, targeting the placement and the routing phases. Chapter 4 describes in detail the hardware/software co-design approach and presents the results obtained. Chapter

5 investigates the custom hardware approach, and compares the results obtained with the hardware/software co-design approach. Chapter 6 addresses the config­ urable processor approach, and compares the results obtained by the previous two approaches. Chapter 7 concludes the thesis and presents possible future work. Chapter 2

Background

Since the invention of Integrated Circuits (IC's) in 1960's, the fabrication tech­ nology of ICs has improved from being able to integrate a few transistors in Small

Scale Integration (SSI) to millions in VLSI technology. The present VLSI tech­ nology allows us to integrate entire systems with billions of transistors on a single

VLSI chip. This could not be achieved without the assistance of computer pro­ grams during all steps of the design process. Computer Aided Design (CAD) is the process in which the VLSI chip is designed with the help of computer programs.

On the other hand, Design Automation (DA) refers to fully computerized design process with minimal human interventions.

This dissertation is concerned with the implementation process of hardware accel­ erators using Reconfigurable Computing Platforms to speedup the phase of global routing within the VLSI physical design automation stage. Accordingly, in this

Chapter, we present necessary information related to the VLSI design process,

Physical Design Automation, Global Routing and RCSs.

7 CHAPTER 2. BACKGROUND

2.1 VLSI Design Process

Since current VLSI circuits are very dense, designing a VLSI chip is increas­ ingly becoming a complex task. To decrease the complexity of the design process, a number of transitional levels of abstraction are introduced [17]. As shown in

Figure 2.1 the VLSI design process is broken down into five stages: System Speci­ fication, Functional Design, Logic Design, Circuit Design and Physical Design [18].

Each design stage is further broken down into three steps: synthesis, analysis and verification.

requirements

System Specification

f "] Functional ALU Coutr REG Design Unit ADDER! \

Logic -rv, 1 "li. -^ DFF J Design I "£^>\z° - •

Circuit Design

Physical Design, cldfcrrrrfcrifcrzrri

fabrication

Figure 2.1: The VLSI Design Process CHAPTER 2. BACKGROUND 9

2.1.1 VLSI Physical Design Automation

One of the main goals of an IC designer is converting a circuit description into a legal geometric description known as a layout. A layout is composed of set of planer geometric shapes in several layers. The layout is used to make patterns called masks, which are then used in the fabrication process to pattern a silicon wafer using a sequence of photolithographic steps. VLSI physical design automa­ tion refers to the process through which the specification of an electrical circuit is converted into a layout [12]. This process is typically divided into five phases: parti­ tioning, floor planning, placement, routing and compaction as shown in Figure 2.2.

Physical Design

Circuit Deaan Netlist Partitioning

Floorplanning & Placement

Routing

Layout Compaction Fabrication

Figure 2.2: VLSI Physical Design Cycle

The input to the process is a netlist (typically described using a hardware-description language) and the output is placement of the modules on the chip and routing of the circuit netlist. CHAPTER 2. BACKGROUND 10

The following is a brief description of all stages within the VLSI physical design cycle:

• Partitioning, refers to the process through which a complex circuit, that

consists of a huge number of components, is decomposed into a set of smaller

sub circuits (blocks). The objective of this stage is to minimize the nets

spanning between the blocks. The partitioning phase takes into account many

factors such as the number of blocks, the size of the blocks, and the number

of interconnections between the blocks. The output of this stage is a set of

blocks with the required interconnections.

• Floorplaning and Placement: The target of the placement stage is to

obtain a minimum area arrangement for the blocks that leads to a routable

design where all connects are present and the time constraint satisfied. Fol­

lowing the partitioning stage, the area occupied by each block can be esti­

mated and the number of the required terminals (pins) are known. In order

to complete the layout, the blocks are needed to be arranged on the layout

surface according to the netlist. The arrangement of the blocks is done in

the placement stage. In this stage, blocks are allocated a specific shape and

are placed in the layout surface in such a way that enough space is left in

the layout to allow the interconnections between the blocks, with no overlap

between any two blocks.

• Routing: The objective of this phase is to minimize congestion, wirelength,

delay and power dissipation. Routing tends to complete all blocks connection CHAPTER 2. BACKGROUND 11

using a shortest path heuristic. This is usually done in two phases: global

routing phase and detailed routing phase. In global routing , connection be­

tween the appropriate blocks are completed neglecting the precise geometric

details of each wire and pin. Following the global routing phase is the detailed

routing phase in which point-to-point connections are completed between pins

on the blocks. Because of the nature of the routing problem, connecting one

net might prevent other nets from being routed. As a consequence, a tech­

nique, called rip-up-and-reroute, is used that essentially gets rid of congested

paths and reroutes them in different order.

• Compaction, tends to compress the layout in all directions such that the

total area is decreased. As a result, the wire length is reduced that leads

to faster designs. Furthermore, small areas translate to better layout, and

consequently the manufacturing cost is minimized.

2.2 Global Routing

Today's VLSI chips contain a billion transistors or more. Consequently during routing, tens of thousands of nets (pins that need to be connected electrically together) must be connected to complete the layout. Furthermore, each net may be routed using one of possibly hundreds or even thousands of feasible paths. This makes the routing problem computationally difficult. For this reason, the routing stage is broken down into two phases: global routing and detailed routing.

While the global routing considers the entire layout to obtain a loose routing, the detailed routing only considers certain areas and obtains the final routing for CHAPTER 2. BACKGROUND 12 the layout.

2.2.1 Routing Problem Definition

The general definition of the routing problem can be stated as follows: Given a set of blocks with pins on the boundaries, a set of signal nets and the positions of the blocks on the layout, routing is the process of successfully connecting all sig­ nals nets subject to timing constraints. Figure 2.3 shows an illustration of general routing.

rh N rmr 3- 3

N1={1,2,3) N2={4,5,6.7} N3={0,8.9) -E 3- .An n -E 3 U""U U

Figure 2.3: An Illustration of General Routing

2.2.2 A Classification of Global Routing Algorithms

Two different approaches have been proposed to solve the global routing problem: the concurrent approach [19] and the sequential approach [13]. The classification of the global routing algorithms is shown in Figure 2.4. CHAPTER 2. BACKGROUND 13

Global Routing Algorithms

Concurrent Approach Sequential Approach

Integer Multi-Terminal Algorithm Programming Method

Hierarchical Approach

Lee's Algorithm

Soukup's Algorithm

Hadlock's Algorithm

Figure 2.4: The Classification of the Global Routing Algorithms

In the concurrent approach, all nets are routed simultaneously to avoid the net ordering problem [20]. In contrast, nets are routed one-by-one in the sequential approach. However, routing one net at a time may prevent other nets from being routed. For this reason, the rip up and reroute technique has been proposed [21].

The sequential approach includes two terminal routing, such as maze routing algo­ rithms [13], and multi-terminal algorithms, such as Steiner tree based algorithms

[22].

2.3 Maze Routing Algorithms

Maze routing algorithms are a class of general purpose routing algorithms which use a grid model. In the grid model, the entire routing plane is represented as a rectangular array of grid cells. The size of the grid cell is scaled such that other CHAPTER 2. BACKGROUND 14 nets can be routed through an adjacent cell, while respecting the width and the spacing rules of the wires.

2.3.1 Lee's Algorithm

In [13], Lee proposed an algorithm for connecting a two terminal net on a grid.

The algorithm is popular because it guarantees to find the shortest path between two terminals if one exists. The algorithm consists of three phases. In the first phase, a modified breadth-first search is performed. In particular, a wave starts to propagate from one terminal, called the source terminal (S), labeling all free grid cells adjacent to the source with Is. Next, 2s are entered in all free grid cells adjacent to those containing Is, and so on, as shown in Figure 2.5(a) until the target terminal (T) is reached. The distance between any cell and the source is indicated by the label of the cell. This phase is known as the wave propagation phase. The second phase is called the retrace phase. In this phase, the labeled cells are retraced in decreasing order from the target to the source as shown in Figure

2.5(b). The final phase is called the clean-up phase. In this phase, all labeled cells, except those used for the path just found, are cleared as shown in Figure 2.5(c).

2.3.2 Limitations of Lee's Algorithm for Large Circuits

In Lee's algorithm, if L is the length of the route, then the time required for the wave propagation phase is proportional to L2, while the required time for retracing is proportional to L. Therefore, the time complexity of Lee's algorithm is 0(L2). CHAPTER 2. BACKGROUND 15

5 4 5 6 5 4 5 6 7 ' • • 4 3 4 5 6 7 4 3 4 5 6 7 3 2 3 4 5 6 3 2 3 4 5 6 2 1 2 3 4 5 2 1 2 3 4 5 2 3 4 3 4 • ' 1 •I i isSSi-l 2 1 2 3 4 5 2 1 2 3 4 5

(a) (b) (c)

Figure 2.5: Lee's Algorithm: (a) The Wave Propagation Phase (b) The Retrace Phase (c) The Clean up Phase

Furthermore, the required memory for an TV x TV grid plane is 0(N2) and also the worst case running time is 0(N2). Another limitation of Lee's algorithm is that the algorithm does not support multi-terminal routing. So, several techniques have been proposed to get rid of these limitations. To reduce the size of the required memory, different coding techniques for labeling the grid cells have been developed [17].

2.3.3 Reducing the Running Time

The running time is proportional to the number of the cells that need to be searched in the wave propagation phase of Lee's algorithm. Therefore, the following techniques can be used to reduce the number of cells filled:

• Starting point selection: In the classic Lee's algorithm, one of the two

terminals can be chosen to be the starting point. The number of cells filled

can be reduced if the starting point is chosen close to the boundary of the

gird (i.e., farthest from the grid's center). Since the starting point is closer to CHAPTER 2. BACKGROUND 16

the boundary of the grid, the propagation area is bounded by it as shown in

Figure 2.6a. The shaded area in the figure presents the number of filled cells

if either terminal is chosen as the starting point (i.e., source).

• Double fan-out: In this approach, during the wave propagation phase, two

waves are propagated from both terminals. The labeling stops when a point

of contact is reached between the two wavefronts as illustrated in Figure 2.6b.

• Framing: In this technique, a window is constructed around the pair of

terminals to be connected. The size of the window is typically 10% or 20%

larger than the smallest box that contains both terminals to be connected,

as shown in Figure 2.6c. If the path is not found the size of the window is

increased and the search is continued. This technique considerably speeds up

the performance of the wave prpagation stage [17].

s. ! / 5 AT

(a)

(O r;~T™-"i

f St ) i. „_?_••

(b) (c)

Figure 2.6: Schemes to Reduce the Running Time of Lee's Algorithm, (a) Starting point selection, (b) Double fan-out. (c) Framing. CHAPTER 2. BACKGROUND 17

2.4 Reconfigurable Computing Systems

There are several methods for accelerating the execution of algorithms. Appli­ cation Specific Integrated Circuits (ASICs) are mainly used to perform algorithm operations in hardware. ASICs are very efficient in terms of power dissipation and performance since they are designed specifically to execute a given computation.

However, the circuit cannot be modified after fabrication. Software-programmed such as General-purpose processors (GPP) are used to perform the computation by executing a set of instructions. Although, this method is a far more flexible solution, but due to its sequential nature it is slower than ASICs and dissipates more power [23].

Reconfigurable Computing Systems (RCS) fill the gap between the costly ASICs and GPPs. In the reconfigurable computing systems, programmable logic is used to accelerate the performance of the complex algorithms [2]. Reconfigurable comput­ ing systems offer considerably higher performance than the software, while main­ taining a higher flexible system than ASICs.

2.4.1 Hardware/Software Co-design in RCS

In reconfigurable computing systems, two approaches can be used to execute the algorithms: custom RTL and hardware/software co-design. In the first approach, the entire algorithm is implemented in hardware. Even though the custom RTL approach achieves excellent performance, however it is not always possible to im­ plement the entire algorithm in hardware. On the other hand, in the hardware/ software co-design, parts of the algorithm are implemented in software that run on CHAPTER 2. BACKGROUND 18 a general purpose processor, while others are implemented as hardware accelerators attached to the processor.

2.4.1.1 Hardware/Software Partitioning

For systems that contain both hardware and software components, the first step in the design is to partition the algorithm into parts to be executed on the hardware and parts to be executed on GPPs. The partitioning procedure can be achieved either manually by the designer or automatically by a compiler [24].

Amdahl's law [25], shown in Equation 2.1, helps to identify the bottlenecks that can be further improved by implementing them in hardware. Even though, some sections of the program could be implemented in hardware and enhanced, if the overall speedup is not significant, then transferring that section of the program to hardware is meaningless. Accordingly, it is important to identify the main bottle­ necks that consume the majority of the execution time of the program. This can be determined by profiling the software implementation.

SpeedoveraU = — —— (2.1) (l-aj + f Where

SpeedoveraU- the speedup achieved. a- the fraction of the original program that could be improved by implementing it into hardware. s- the speedup obtained (from the hardware) for the particular fraction of the program. CHAPTER 2. BACKGROUND 19

2.4.2 Field-Programmable Gate Arrays (FPGAs)

Field-Programmable Gate Arrays (FPGAs) are used in any reconfigurable com­ puting system. FPGAs can be seen as a matrix of logic blocks surrounded by

I/O cells. In FPGAs, different tasks could be performed by reconfiguring the logic blocks. FPGAs have several advantages over application specific integrated circuits and mask-programmable gate arrays (MPGAs). The advantages include: quick time to market, being a standard product; no non-recurring engineering costs for fabrication; pre-tested silicon for use by the designer; and reprogrammability, that allows designers to enhance or modify logic through in-system programming [2].

2.4.2.1 The General Structure of FPGAs

A typical FPGA basically consists of three types of components: Configurable

Logic Blocks (CLBs), programmable routing, and Input/Output Blocks (IOBs).

Figure 2.7 shows a general FPGA architecture.

The CLB as shown in Figure 2.8 consists of a multi-input Look-Up Table (LUT),

D-type flip-flop and some fast carry logic, which is used to reduce the area and delay costs for implementing carry logic.

The LUT can be configured to give a specific type of function logic, while the

D-type flip-flop allows the CLB to implement both sequential (clocked) logic and combinational (non-clocked) logic [26]. The CLBs can be connected together by configuring the programmable routing. The programmable routing area is com­ posed of metal segments and configurable switches, which is used to connect the CHAPTER 2. BACKGROUND 20

DD DD DD DD CLB IOB • • • • • • • • • • • • • • • Programmable Routing DD DD DD DD

Figure 2.7: A General FPGA Structure

Carry Out

Inputs • Output

Configuration Memory Cell Carry In Clock

Figure 2.8: A General Configurable Logic Block[2] CHAPTER 2. BACKGROUND 21 wires to each other and to the circuit component input and output pins. The

IOBs are used to interface the internal logic with the I/O pins. The I/O pins are programmable for input and output functions.

2.4.2.2 Routing in FPGAs

Routing FPGAs is a challenging task due to the constraints imposed by the routing resources, both metal segments (i.e., wires) and configurable switches. One of the most popular FPGA routing techniques, Path Finder, was proposed by Mc-

Murchine [27]. The pathfinder algorithm ensures that all signals are completely routed and the delay of the interconnect is minimized. The pathfinder consists of two parts: a single router, in which a shortest path algorithm is used to route one signal at a time, and a global router. While, the global router utilizes the single router to connect all signals, the single router uses a breadth-first search to find the shortest path, taking into account a congestion cost and delay for each routing resource. The pathfinder was tested on two different FPGA architectures Triptych and Xilinx 3000 architecture. The results of the Triptych FPGA showed that the algorithm minimized the congestion while meeting the delay constraint. Further­ more, an 11% shorter critical path than for commercial tools was achieved in the

Xilinx 3000 architecture. CHAPTER 2. BACKGROUND 22

2.5 Application Specific Instruction-set Proces­

sors

The necessity in present complex designs for computationally demanding algorithms are high speed, flexibility and low power dissipation. Traditionally, both ASICs and

Digital Signal Processing (DSP) processors are used as possible solutions. While

ASICs are very good in terms of performance and power dissipation, they are not flexible and no further modification can be performed following fabrication.

On the other hand, the DSP processors are more flexible and programmable, but they are slower than ASICs and dissipate more power. Therefore, the new flexible architectures of Application Specific Instruction-set Processors (ASIPs) can be used instead of multiple chip designs that have been fabricated as an ASIC architecture, while having flexible architecture. In addition, the ASIPs are more customized to a certain application unlike the DSP. ASIPs are composed of programmable processor cores and customized hardware modules within the data path, which allow developers to expand the instruction set with application-specific instructions.

There are two approaches to synthesis ASIPs. In the first approach, an available processor can be customized, while in the second approach the developer has to build the instruction and the data paths from the scratch.

2.5.1 Tensilica Configurable Processors

Tensilica Xtensa configurable processors [28] provide a novel approach of ASIP syn­ thesis. The Tensilica Xtensa configurable processors are based on two principles: CHAPTER 2. BACKGROUND 23 configurability and extensibility. The ASIP processors provided by Tensilica, can be configured and the required function units can be specified as well. Furthermore, when the performance of the configured processor is not sufficient, it can be en­ hanced by extending the processor Instruction Set Architecture (ISA). The specific instruction set allows the application to run faster. On the other hand, fixing errors and modifying the processor can be accomplished in several hours.

Tensilica processors are characterized by either the off-the-shelf cores via the

Diamond Standard Series or by the full configurable cores through the Xtensa processor family. The Diamond series covers a range of performance scenarios with five cores, which include a small 32-bit controller in top of a high performance audio/video engines. The Xtensa processors, on the other hand, can be customized at the micro-architectural level to achieve specific application requirements. Further details about the Tensilica processors and the design flow will be presented in

Chapter 6.

Figure 2.9 summarizes the performance of the different approaches that have been used as computing platforms such as GGP, ASIC, H/S co-design, RCS, and

ASIPs against their flexibility.

2.6 Benchmarks

In this thesis, six synthesized benchmarks were used to test the proposed systems.

The benchmarks are modified version of the ISPD98/IBM benchmarks [29]. The main modifications were reducing the number of nets, the number of pins and the grid size. The benchmarks range in size from one layer 8x8 with 17 pins and 8 nets CHAPTER 2. BACKGROUND 24

Performance

Figure 2.9: The Different Computing Approaches to one layer 64 x 64 with 512 pins and 215 nets. Moreover, each benchmark has two terminal nets and multi-terminal nets (i.e., consists of three or more terminal pins). Table 2.1 summarizes the six benchmarks.

Benchmarks Grid Total Total Number of Terminals Size Pins Nets 2 3 4 5 1 8x8 17 8 7 1 - - 2 16 x 16 64 30 26 4 - - 3 16 x 16 66 31 27 4 - - 4 32x32 256 120 104 16 - - 5 32x32 264 124 108 16 - - 6 64 x 64 512 215 148 54 11 2

Table 2.1: Benchmarks

The different characteristics of the synthesized benchmarks will hopefully al­ low an adequate comparison between the developed hardware accelerators in this dissertation. CHAPTER 2. BACKGROUND 25

2.7 Summary

This chapter introduced an overview of the VLSI design process, VLSI physical design automation. The routing problem is one of the most time consuming steps in VLSI physical design. Lee's algorithm is an effective routing heuristic since it can find the shortest path between two terminals if one exists. It is widely used for routing problems in ASICs and FPGAs. Reconfigurable computing systems and FPGAs were also introduced. FPGAs were briefly presented as a possible paradigm to implement digital systems and architectures. The application specific instruction-set processors and Tensilica configurable processors were also presented and their advantages highlighted as an alternative to GPPs and ASICs. Chapter 3

Literature Review

The main objective of this chapter is to present previous studies and research that have been carried out in developing accelerators for the VLSI physical design au­ tomation, particularly for global routing. In recent years there have been attempts to accelerate layout algorithms for both ASICs and FPGAs. These attempts fall under one of the following categories:

1- Pure software implementations running on pipelined or multiprocessor systems;

2- Pure hardware accelerators targeting ASICs for achieving high performance;

3- H/S co-design and pure hardware accelerators targeting FPGAs that can be re­ configured and modified while achieving speedups not attainable by conventional state-of-the art general processors.

During the last three decades, researchers have focused on designing hardware accelerators for different algorithms used in the physical design automation phase.

Figure 3.1 illustrates the three main categories of hardware accelerators targeting

CAD tools which are considered in this thesis; the main focus of the literature re-

26 CHAPTER 3. LITERATURE REVIEW 27 view will deal with ASIC routing based accelerators.

Hardware accelerators for CAD

FPGA Routers

Figure 3.1: Hardware Accelerators for CAD

Two time consuming processes in physical design automation are placement and routing. These two phases are tightly coupled, since the quality results of the routing phase depends on the placement phase results. The next two sections in­ troduce some of the attempts to accelerate the placement and routing stages for

FPGAs respectively. Section 3.3 on the other hand presents the efforts made to improve ASIC routers in terms of performance and solution quality based on three different approaches: general purpose processors, ASIC-based implementations and

FPGA-based implementations. CHAPTER 3. LITERATURE REVIEW 28

3.1 Placement Based Hardware Accelerators

Handa et. al [3] proposed three different architectures for two dimensional online placement which are serial, parallel and serial-parallel architectures. Each architec­ ture has a different trade off in terms of FPGA resources, execution time, memory requirement, host overheads and reconfiguration overheads. Figure 3.2 shows the model of the reconfigurable dynamic system; the main components of the system are the host computer and a partially reconfigurable FPGA.

Y D r

data data Shared f Memory

configuration data and control

Host Partially Reconfigurable FPGA

Figure 3.2: The model of the partially reconfigurable dynamic system [3]

The FPGA is partitioned into an operating system (OS), which is used to run the placement engine, and an application area. The host computer dynamically determines the boundary between the application area and the OS area. The

FPGA surface is modeled as a two dimensional array called an area matrix. Each combinational logic block (CLB) in the FPGA is represented as a cell in the area matrix. The serial architecture shown in Figure 3.3 consists of area matrix memory, CHAPTER 3. LITERATURE REVIEW 29 comparator, column counter (down), and address counter (up), while the parallel architecture consists of area matrix memory, reconfigurable adder, address counter

(up), parallel comparator and controller state machine as shown in Figure 3.4.

Macro Height

I) QA^GE_B L Q_Threshhold A Load Clk Clk

Column Counter (DOWN)

Q

Clk

Memory Address Counter (UP)

Figure 3.3: The Serial Architecture [3]

In the third architecture, both the serial and the parallel architectures were combined to create the serial-parallel architecture with some modifications as shown in Figure 3.5. In the three architectures, the matrix memories are used to store the area matrix.

A comparison between the three architectures is presented in Table 3.1.

RF and CF are the number of rows (height) and columns (width) of the FPGA, and HT and WT are the height and the width of the task that is needed to be placed. The work [3] did not present the speedup achieved over the pure software implementations. CHAPTER 3. LITERATURE REVIEW 30

Comparators Decoder r >

/ Data 4fl

Recon fig LI rable Adder Paralle Comparator

L Address Counter (UP) Area Matrix Memory Controller State Machine

Figure 3.4: The Parallel Architecture [3]

[ Macro HeightJ

PJn Q_Triieslitiold[ S_out

i> CE

Q CE

Address Counter(UP) Aiva Matrix Memary Controller Suite Machine

Figure 3.5: The Serial Parallel Architecture [3] CHAPTER 3. LITERATURE REVIEW 31

Serial Parallel Serial-Parallel Arch. Arch. Arch. Execution Time Medium Fast Slow (Clock Cycles) RFX CF RF x HT RF x CF X WT Memory More Less Less Requirement RFX CF .RFX CF i?Fx CF (Bits) x log RF Resource Usage Low High Medium Host Overheads Medium Low Low Reconfiguration Overheads Low/Nil High Low

Table 3.1: Comparison between the three placement architectures

3.2 Accelerators for FPGA Routers

Recent research has focused on accelerating the FPGA routing algorithms. In the FPGA routing the directed graph models the FPGA's programmable intercon­ nections, replacing the grid graph in the ASIC's routing. This section presents the hardware accelerators and the parallel processing approaches that were imple­ mented to speed up the run time of the FPGA routers.

3.2.1 Distributed Workstations

In [30], a distributed version of a negotiation based router that runs on a network of workstations was implemented. In this work, two methods were proposed to accelerate the FPGA router. In the first method, the pathfinder algorithm [27] was implemented in a fairly simple FPGA and memory-based hardware. This approach achieved a 5x speedup over the pathfinder algorithm.

In the second method, the serial pathfinder algorithm was restructured into a CHAPTER 3. LITERATURE REVIEW 32 parallel routing algorithm so that it could run on a cluster of uniprocessor worksta­ tion. The implementation results showed a 3x speedup over the serial path finder algorithm.

3.2.2 Pure Hardware Accelerators

DeHon et al. [4] proposed an aggressive acceleration approach in which hard­ ware was added to an High-Speed Hierarchical Synchronous Reconfigurable Array

(HSRA) network [31] to accelerate the path finder routing algorithm [27]. The main idea of this approach is to use the network itself to perform the parallel routing search and to keep track of the congestion of the network. To obtain an avail­ able path in HSRA network, a search was started from both terminals (source and target) and then free (least cost) paths were traced from the two terminals to a crossover switch. A viable path is found when the search from the two terminals are met on one (or more) wires at the crossover switch. Next, the path for these two terminals is set in the network. At the uncomplicated conceptual level, a logic

OR is added between the two children channels of each uplink switch in the HSRA network as shown in Figure 3.6. With this addition, a route trial could be roughly performed.

More complicated architectures were designed to directly support the allocation of the routed path and the rip-up of routed path when no path could be found.

Simulation results showed that the proposed design could achieve a speedup of three orders of magnitude over the path finder algorithm with a solution quality within

5% - 25% of the state of the art routing algorithms. CHAPTER 3. LITERATURE REVIEW 33

parent

Figure 3.6: HSRA T-Switch with Path-Search OR [4]

Further work by Huang et. al [32] improved the quality of the routing results of

DeHon's approach by adding modules that take into account the congestion cost.

Furthermore, the design supports multiple terminal nets and mesh-style routing topologies. The work explored how to implement the hardware on an existing

FPGA. Although, simulations expect huge improvement in terms of routing quality and processing time, the hardware cost was very high, where 155 4-input LUTs were required for each PE for a single layer mesh topology.

Some of the approaches that have been proposed to improve the FPGA's routers were presented in this section. The proposed designs were intended to speedup the processing time taking into account the solution quality. CHAPTER 3. LITERATURE REVIEW 34

3.3 Accelerators for ASIC Routers

This section presents some previous attempts intended to improve the solution qual­ ity and speed performance of the detailed routing problem for application specific integrated circuits (ASIC). These include: general purpose processors, ASIC and

FPGA based implementations.

3.3.1 General Purpose Processors

The previous research in this area can be classified into: multi-processor and pipelined processor architectures.

3.3.1.1 Multi Processor Architectures

In [33], the study presents Lee's algorithm [13] and proposed a parallel adaptable algorithm (PAR) running on an adaptive array processor (AAP-1). The AAP-

1 consists of a PE array, a control unit, a data buffer memory, an instruction memory, and an interface unit. The proposed algorithm consisted of two classes.

The first class (PAR-1) was a controllable path quality algorithm similar to Lee's algorithm with minor modification: the expansion distance was not limited to unity, but could take any value for each step. The second class (PAR-2) was a Quasi minimum Steiner tree finding algorithm, a parallel algorithm used to support multi terminal routing. The main idea was to obtain a restricted area by doubling the wave expansion between the first two terminals, and then an attempt was made to connect the restricted area to the next terminal. The implementation results showed a speedup of lOOx for Lee's algorithm that ran on an AAP-1 and 230x for CHAPTER 3. LITERATURE REVIEW 35

PAR-2 over Lee's algorithm that ran on an MPS sequential computer for a 256 x 256 routing grid with a two layered problem.

3.3.1.2 Pipelined Processor architectures

Sahni et. al [34] proposed an accelerator consisting of a three 3-stage pipelined processors, Banked Queue Memory (BQM) and Banked Cell Memory (BCM) as shown in Figure 3.7a.

Banked Queue Memory Stage 1 Stage 2 Stage 3

Pipel Pipelined Processors

Pipe 2 tot -

Banked Cell Pipe 3 Memory -

(a) Maze Router (b) Pipelined Processors

Figure 3.7: Maze Router General Architecture and Pipelined Processors

The main purpose of using the banked memories was to avoid read/write con­ flicts. The BQM was used to store the cells that had to be expanded and the BCM was used to store the routing grid. Figure 3.7b shows the pipelined processors; the responsibility of each pipelined processor is to examine one of the neighbors of the incoming cell and independently determines which neighbors of the incoming cell should be examined. The appropriate cells are then fetched from the BQM and past to stage 1 of each of the three pipelines. While stage 1 was responsible for CHAPTER 3. LITERATURE REVIEW 36 neighbor cells fetching and some bookkeeping, stage 2 determined the position in the queue where these neighbors should be added. In stage 3, the queued cells were properly labeled and their addresses were added to the queue.

A reasonable performance improvement was provided by this design over unipro­ cessor implementations. The speed up achieved was between 1.66, for the worst case, and 5 for the best case. Although a three 3-stage pipelined processor was used there is a limit in terms of the amount of speedup achieved using this approach.

Rutenbar et. al [35] proposed a class of cellular architectures named raster pipeline subarrays (RPS) intended for physical design automation. A PRS machine is a pipeline of stages in which a large grid is processed in a serial cell stream.

The PRS consists of several pipeline stages where each stage is a single raster subarray. There are three basic components in each stage, a line buffering scheme, a subarray storage and a subarray processor. The sequence of operations that one subarray processor performs is called a subarray computation. While both design rule checking (DRC) and Lee's routing algorithm were tested in an RPS environment, a cellular DRC was implemented using an algorithm for a width of

3A. Eight subarray computations were used to run the width 3A check, and the algorithm was tested on a 64 x 64 test pattern. The proposed cellular DRC correctly detected all errors. Moreover, one layer and two layer Lee's routing algorithms were implemented in the cellular architecture. The implementation results showed that a design with 10 to 100 stages can quickly route up to a 1000 x 1000 grid and 91% of the nets were routed.

Some of the approaches that have been proposed to improve the routing prob- CHAPTER 3. LITERATURE REVIEW 37 lem solution have been presented in this section. The proposed approaches were intended to handle large routing problems and speed up the processing time. While the general purpose implementations improved the quality of the solution, they did not perform as well as ASIC implementations as will be explained in the next sec­ tion.

3.3.2 ASIC-Based Implementations

In this section ASIC hardware implementations designed to accelerate the routing problem are discussed.

In [36], the implementation of a virtual-grid architecture for a hardware ac­ celerator based on Lee's algorithm was investigated. The main approach here is mapping multiple grid points onto each processing element (PE); this was done by mapping an N x N grid of cells onto a machine with O(N) PEs called a wavefront machine. Figure 3.8 shows the basic structure of the wavefront machine, consisting of a host computer, a control unit (CU), and an array of PEs. The CU controls and synchronizes the PEs' operations and communicates the wavefront with the host computer.

As a result of mapping multiple grid points onto each PE, the number of PEs is reduced, but the complexity of the PEs increased. A prototype machine with 64

PEs was designed to implement a two layer 128 x 128 grid plane. Furthermore, the interactive rip-up and reroute was supported by the design. The implementation results showed the average of the distribution of the processor utilization was about CHAPTER 3. LITERATURE REVIEW 38

Host computer

Figure 3.8: Basic structure of the wavefront machine

50 %, much higher resource utilization than the full gird architecture. Moreover, the wavefront machine runs in 0(L116) time, almost linear running time.

Another attempt to accelerate the grid routing using an ASIC implementation was performed by Iosupovici [37]. The IAP architecture is basically an ASIC im­ plementation of Lee's algorithm using a class of two dimensional single SIMD (

Single Instruction Multiple Data) array processors. The architecture consists of a central processor and four supporting chips; each chip contains an N x N grid of processing elements and four interface controllers: IFN, IFW, IFE and IFS, which are used to interface the chip to its North, West, East and South neighbors respec­ tively. Each PE in the grid is addressed through an X-Y bus directly by the central processor. Two major advantages were achieved in this design; the first advantage was that the design was not pin limited and the second was that the computation time was O(L) where L is the path length. Moreover, more advanced machines CHAPTER 3. LITERATURE REVIEW 39 were designed, such as the MPIAP [37] which supports multi terminal net routing and the CF3IAP [37] which is a router that minimizes turns in the routes.

A family of iterative state machine arrays (ISMA) was proposed by Ryan et. al [38]. The architecture consists of a host computer, a host interface and a local control which controls the accelerator and the ISMA engine. The ISMA engine is a rectangular array of small state machines that can connect only to its neighbors.

The presenting problem is divided into small subsections named windows. The size of the window is equal to the area that the ISMA can accommodate; on the other hand, the memory module is used to store a single window. A specific ISMA was designed to accommodate Lee's algorithm in which the full grid points were divided into small subsections. These subsections were labeled as either source, target, empty or obstacle. The ISMA engine used these labels to generate a rough shortest path, similar to global routing, which was a path of windows instead of grid points. The host or local controller used the rough shortest path to detect the next window in which the memory module is stored. A 16 x 16 two layer ISMA

Lee's architecture was implemented where each point in the grid points was an array of 16 x 16 cells. The implementation results showed a speed up of 200x over the traditional Lee's algorithm and a single 40,000 gate IC was used to implement this design, while sixteen ICs, each consisting of 300,000 gate ICs, were used to implement the full grid L machine.

Mazumder et. al [39] proposed a virtual hardware accelerator for Lee's algorithm called a hexagonal mesh architecture. The architecture consists of a host computer, a C-wrapped hexagonal mesh of PEs in which each PE is connected to six neighbors, CHAPTER 3. LITERATURE REVIEW 40 and an array control unit (ACU), which controls the mesh array and interfaces it with the host computer. Figure 3.9 shows the block diagram of a single PE, consisting of a receive unit, an update unit, a node ID, a next cell, an expand cell, a send unit and a local memory that stores the information about the different cells that have been mapped onto the PE.

Data + Strobe lines from each neighbor

-4 RECEIVE UNIT f* 1 1 ' I L»| UPDATE UNrTl |NODE IC| LOCAL MEMORY' Xsel /\ i Status Yssl .1 Global data E|W|N|S|U|DI l 1 Labels J Global control — —

•1 NEXT CELL UNIT I . T . H EXPAND UNIT I T -*\ SEND UNITDat a +| Strobe lines to each neighbor

Figure 3.9: Block diagram of a single PE

The main advantage of the hexagonal mesh architecture is that a mesh of dimen­ sion \[G~k can create a k layer kG2 grid point which is about 20% of the hardware needed to implement a full grid architecture. On the other hand, the complexity of the PE increases and the system is slower than the full grid.

In summary, ASIC implementations can achieve very high performances such as the IAP architecture, in which the computation time is 0(D) for single layer routing problems, where D is the length of the path. The inflexible nature and the high cost of the ASIC based approaches lead the research toward the FPGA implementations where the designs can be more flexible than the ASIC approaches CHAPTER 3. LITERATURE REVIEW 41 and faster than the pure software implementations.

3.3.3 FPGA-Based Implementations

This section introduces research that attempts to accelerate the classic Lee's algo­ rithm using FPGA platforms.

Nestor [5], [40], [41] proposed an FPGA implementation and the description of an adjusted full-grid technique named L3 for multilayer ASIC routing. The main adjustment was to time multiplex a two-dimensional array of processing elements over multiple layers. Figure 3.10 shows the basic organization of the L3 architec­ ture in which commands that request routing between a source and a target are broadcast by the host processor. FIFO buffers are used between the host processor and the accelerator to decrease communication overhead.

The L3 architecture consists of a control unit and N x N grid of Processing

Elements (PEs). Each PE is a finite state machine that represents the grid point's status. Each PE is connected to the adjacent cell in the array using XO, which is true when the cell is marked during the expansion phase. A shift register was used to store the state of each PE, while a common sequencing unit was used to circulate each layer's state. This trade off decreased the number of processing elements for the L layer N x N grid from L x N2 to N2. On the other hand, the expansion execution time increased from 0(d) to 0(d x L).

The L3 implementation was compared with a pure software implementation of

Lee's Algorithm that was written in ANSI C and ran on a 2.53 GHz machine. The implementation results showed a speedup from 49.94 to 76.41 for a 8 x 8 x 4 array CHAPTER 3. LITERATURE REVIEW 42

routing commands (src, tgt)

^1 nrv I w Host Processor

FIFO Control routing results Unit (segments)

PE Array array status

West / EI xo / East Neighbor w , Neighbor

pref (global)

South Neighbor

Figure 3.10: L3 General Organization [5] CHAPTER 3. LITERATURE REVIEW 43 and an overall speedup of 93.62 for a 16 x 16 x 4 array. L3 did not support multi terminal routing and provided limited support for rip up and reroute.

The work by [6] presents an improved hardware accelerator design named L4 based on L3 that is intended for detailed routing of integrated circuits. The archi­ tecture as shown in Figure 3.11 consists of a two dimensional array of PEs, control unit, row and column decoders and a FIFO buffer with a PCI target to interface the architecture with the host computer.

routing commands FIFO -# PECMD! leg, wesa Hosl „ PCi Computer Target I I I I I I P H FIFO h ling res rot jits tamsss

PE Array PESTATE....OUT

Figure 3.11: L4 Architecture [6]

Each PE has STATE-IN and CMD inputs; the control unit broadcasts these commands to all PEs simultaneously; the control unit reads the logic AND of

STATE-OUTs of all PEs. The commands performed by each PE is summarized in

Table 3.2.

The routing of the multi terminal nets was directly supported by L4. Moreover,

L4 had a new feature known as Etching that accelerated the rip and reroute when a failure in routing occurred due to obstacles. A 16 layer 32 x 32 routing array was CHAPTER 3. LITERATURE REVIEW 44

Command Action Read if(RSEL and CSEL) STATE-OUT=CS WRITE if(RSEL and CSEL) NS-STATE-IN CLEARX if CS=(XE,XW,...,XU,XD) NS^EMPTY if(EI) NS=XE EXPAND elseif(WI) NS=XW; elseif(NI) NS=XN elseif(SI) NS=XS; elseif(UI) NS=XU elseif(DI) NS=XD

Table 3.2: PE Commands implemented, which supported etching and multi terminal routing. The implemen­ tation results showed a speedup of 29x to 93x over the classic Lee's algorithm and

5x to 19x over the A* algorithm, both of which were written in C.

In summary, the FPGA implementations of the detailed routing improved the performance of the pure software implementations. Nestor's FPGA-based multi­ layer maze routing accelerator managed to speed up the classic Lee's algorithm from 0(L x D2) in software to 0(L x D), where L is the number of layers and D is the path's length. Moreover, the FPGA based implementations have an edge over

ASIC based implementations due to their flexibility.

3.4 Summary

This chapter presented previous efforts made in implementing hardware accelera­ tors for CAD algorithms, specifically routing and placement techniques for VLSI physical design automation. Three approaches were used to implement accelerators: general-purpose processors, FPGA-based implementations and ASIC-based imple­ mentations. The result of implementing the architectures on pipelined processors CHAPTER 3. LITERATURE REVIEW 45 or multi processors structures showed that these platforms can not provide high performances, but they have the highest flexibility among all approaches and they could accommodate large grid point planes. On the other hand, ASIC implementa­ tions achieved very high performances, but the inflexible nature and the high cost remain the biggest obstacle. The third approach achieved faster designs than the general purpose processor architectures and more flexible designs compared with the ASIC designs. For example, the FPGA-based multilayer maze routing acceler­ ator enhanced the performance of the classic Lee's algorithms from 0(L x D2) in software to 0(L x D), where L is the number of layers and D is the length of the path. Other implementations can be proposed based on the presented approaches to improve the performance of ASIC routers. The first implementation is designing a hardware/software co-design using MicroBlaze Xilinx processor. Next, a custom hardware implementation of Lee's routing algorithm will be designed using the

Handel-C approach. Finally, we will explore a novel flow based on Tensilica tool

[28] to design a dedicated instruction set processor for the global routing based on the classic Lee's algorithm. These three implementations will be described in more details in the next few chapters. Chapter 4

Hardware/Software Co-design

This chapter discusses the MicroBlaze (soft core) based implementation approach employed in this thesis to speed up the performance of Lee's algorithm. In this approach, Lee's routing algorithm was implemented using both a pure software based approach and a hardware/software co-design approach. In the pure software implementation two different search techniques were implemented. In the first technique, the entire grid is searched to find the path between the given source and target, while in the second technique a window is constructed around the pair of the terminals. The pure software implementation was developed using the C programming language.

4.1 Methodology

The overall design methodology of the HW/SW co-design approach is shown in Fig­ ure 4.1. The software implementation of Lee's routing algorithm is initially devel-

46 CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 47 oped using the C language and is then mapped onto the FPGA board and verified using several benchmarks. The software is profiled to determine the bottlenecks

(time-consuming parts). The next step involves designing a hardware accelerator using VHDL for the bottlenecks. Lee's routing algorithm is then implemented on an FPGA chip using a hardware/software co-design approach based on the MicroB- laze soft processor and dedicated hardware modules. Finally, the performance of the system is evaluated.

Software Implementation on PC t Software Implementation on MicroBlaze t Profiling

t Identify Bottlenecks

t Implement Hardware Accelerator

t H/S Co-Design

* Evaluate Performance

Figure 4.1: The Design Methodology CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 48

4.2 Design Flow of Lee's Algorithm

The software design of Lee's algorithm is based on the flow chart shown in Figure 4.2 . The details of Lee's algorithm were presented in Section 2.3.

/ R>

Y

Rip up phase

Reroute phase

* f End )

Figure 4.2: The Flow Chart of Lee's Algorithm CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 49

4.3 A Pure-software Based Implementation

A pure-software based implementation was implemented by mapping Lee's algo­ rithm on a MicroBlaze system. The MicroBlaze embedded soft core is a 32-bit

RISC processor, which can be developed on a single Xilinx FPGA chip [7].

4.3.1 Implementation on a MicroBlaze System

The software implementation of Lee's routing algorithm is developed on a PC, and then is downloaded to the FPGA chip and eventually run on the MicroBlaze soft­ ware system. The MicroBlaze system for the pure-software based implementation as shown in Figure 4.3 consists of MicroBlaze core, local memory bus (LMS) to connect the MicroBlaze core with the block RAM through the LMB block RAM controller, On-chip Peripheral Bus (OPB), OPB timer, and OPB UART. The OPB timer is used to calculate the total number of clock cycles, while the OPB UART is used to communicate between the PC and the MicroBlaze system through the

HyperTerminal communication tool that runs on the PC.

The inputs to the MicroBlaze system are the netlist text file, placement text file and the C code that performs all of the functions of Lee's algorithm. The netlist file contains the grid size, the number of nets, the number of pins, and the pins'

ID of each net, while the placement file contains the position of each pin on the grid. The two text files are sent to the MicroBlaze system using HyperTerminal.

The two text files along with the C code are stored in the Block RAM to avoid the delay related with the external memory.

The output is a text file that contains the total number of clock cycles, and a CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 50

Block RAM

LMB Block RAM Controller

MicroBlaze If core o •a

On-chip Peripheral Bus (OPB)

OPB OPB UART Timer jMicroBlaze System FPGA Chip

HyperTerminal Monitor PC

Figure 4.3: The MicroBlaze System for the Pure-software Based Implementation CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 51 grid showing the connection of the nets. The output is displayed on the PC monitor through the OPB UART and the HyperTerminal. The Xilinx Embedded Develop­ ment Kit (EDK) version 8.1 was used for design entry, synthesis and mapping the design onto the FPGA board. In this approach, a Xilinx Virtex II Pro XC2VP100

FPGA board was utilized.

4.3.2 Major Software Functions

4.3.2.1 Source / Target Assignment

After receiving the netlist file and the placement file and storing them into the block RAM, the first step is to initialize two arrays which are used to obtain the path between the given terminals B array and to display the final solution of the system P array. The next step is assigning the source and the target for the net.

The corresponding pseudo code is shown in Figure 4.4.

FOR each Net n DO Assign Target FOR Number of Pins -1 DO Assign Source END FOR END FOR

Figure 4.4: Assign the Source and the Target

If the net has more than two terminals, the target remains the same, while; the source will be the next pin. The complexity of this function is O(NP) where N is the number of nets and P is the number of pins. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 52

4.3.2.2 Wave Propagation

Once the source and target are assigned the Wave() function is called. This function gets the two arrays, net ID, source and target, and searches the entire grid the two arrays" to find the path between the given pins. The function starts to propagate a wave from the source (S), and then labeling sequentially all free cells adjacent to the source with one. Next, 2s are entered in all free cells adjacent to those containing

Is and so on (refer to Section 2.3.1 for more details). The search continues until the target is reached or the value of the wave (temp) reaches 100. Figure 4.5 shows the pseudo code of the Wave() function. The computation time of this function is 0(N2 x D), where N is the X or Y dimension of the chip and D is the length of the path. If the path is found the Retrace and clean up function is called to retrace the path from the target to the source. Otherwise, the net ID is stored in the unconnected net array.

4.3.2.3 Retrace and Clean up

The retrace and clean up function gets the data from the WaveQ function and starting from the target, it retraces the path to the source and then cleans up the unblocked cells (refer to Section 2.3.1 for more details). Figure 4.6 shows the pseudo code of the function. This function retraces the path in 0(D), where D is the length of the path, and cleans up the other cells in 0(N2), where N is the X or

Y dimension of the chip. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN

WAVE(P, B, NetID, Source(S), Target(T)) temp = 1; WHILE path-exists = False DO FOR i= 0 to XChipSize - 1 DO FOR j= 0 to YChipSize - 1 DO CHECK THE FOUR NEIGHBOURS OF B[i][j] START WITH NORTH CELL IF B[i][j+1] = Unblocked THEN B[i][j+1] = temp; IF B[i][j+1] = Target THEN path-exists = True; EXIT WHILE; END IF ELSE IF P[i][j+1] = NetID THEN path-exists = True; EXIT WHILE; END IF END IF DO THE SAME FOR THE OTHER THREE NEIGHBOURS temp = temp + 1; IF temp > 100 & path-exists = False THEN EXIT WHILE; END IF END FOR END FOR END WHILE IF path-exists = True THEN RETRACE(P, B, NetID, S, T, temp); RETURN 1; ELSE unconnectednet [R] =NetID; R++ RETURN 0; END IF END WAVE

Figure 4.5: The Wave Propagation Function CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 54

RETRACE(P, B, NetID, Source, Target, temp) i = Target (x); j = Target (y) WHILE i ^ Source(x) k j ^ Source(y) DO Check the neighbors of B[i][j] IFB[i][j] = temp THEN P[i][j]=NetID; B[i][j]=blocked; temp = temp - 1; change the values of i or j; END IF END WHILE FOR i= 0 to XChipSize DO FOR j= 0 to YChipSize DO IFB[i][j] ^ Blocked THEN B[i][j] = Unblocked; END IF END FOR END FOR END RETRACE

Figure 4.6: Retrace and Clean up Function CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 55

4.3.2.4 Rip up Phase

The pseudo code of the rip-up phase is shown in Figure 4.8. The function is called if the Wave() function can not find the path for any net. The time complexity of this function is similar to the time complexity of the wave propagation function.

The function reads the unconnected nets from the unconnected net array and rips up the first net that blocks the path to the target. If the path is not found, the function rips up another net and the search is continued. The ripped up nets are stored in ripped up nets array to try to connect them again. The search is continued until the target is reached.

For example, in Figure 4.8(a) S2 and T2 are the source and target of a net that was not connected by the Wave() function. The Rip up function will start the search starting from the source (S2) toward the target (T2). Net 1 (Nl in the figure) blocks the path, so it is ripped up as shown in Figure 4.8(b) and stored in ripped up nets array. The search is continued until T2 is reached. Finally, the

Retrace and clean up is called to finalize the route as shown in Figure 4.8(d).

4.3.2.5 Reroute Phase

This function reads the ripped up nets from ripped up nets array and it calls the

Wave() function to try to connect the ripped up nets. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 56

RIP UP(P, B, NetID, S, T) temp = 1; WHILE path-exists = False DO FOR i= 0 to XChipSize DO FOR j= 0 to YChipSize DO IF B[i][j] = Unblocked THEN B[i][j] = temp; IF i = Tx k j = Ty THEN path-exists = True; EXIT WHILE; END IF ELSE IF P[i][j] = NetID THEN path-exists = True; EXIT WHILE; END IF ELSE Ripup-net[g] = P[i][j]; g++; Free all cells that contains the value of P [i] [j]; B[i][j] = temp; END IF temp = temp + 1; IF temp > 100 & path-exists = False THEN EXIT WHILE; END IF END FOR END FOR END WHILE IF path-exists = True THEN RETRACE(P, B, NetID, S, T, temp); RETURN 1; ELSE unconnectednet [R]=NetID; R++ RETURN 0; END IF END RIP UP

Figure 4.7: The Rip up Function CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 57

:«• T1 2

N1 N1 2 1 2 •mil ill N1 2 1 • • i 11 N1 N1 2 2 1 I $1 2

(b)

T1 3 2 3 ffl 3 2 1 2 i 2 1 1 N2 N2 • 3 2 1 2

.SI 3 2 3

(o) (d)

Figure 4.8: The Rip up Function Steps

4.3.3 Multi-Terminal Nets Routing

The code attempts to connect two terminals and if the net has more than two terminals, the path between the connected terminals and the other terminals is found. To demonstrate this, consider the three terminals in Figure 4.9(a) belong­ ing to the same net. Pin A and B are first connected together. Pin C is now considered the source, while either pin A or B is the target. The search is started form pin C toward the target. As shown in Figure 4.9(b) the propagation wave reaches the connected path between the pin A and B before the target. In this case, the reached cell (cell D in Figure 4.9(b)) is assigned as a new target. Finally, the Retrace and clean up function is called to finalize the route as shown in Figure

4.9(c). CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 58

EST N1 N1 N1 N1 N1 £ 3 i 3 2 3 3 2 2 3 2 1 2

3 2 1 2 3

(a) (W 1 N1 N1 N1 • N1

N1

(C)

Figure 4.9: Multi-Terminal Nets Routing

4.3.4 Profiling

In order to identify performance bottlenecks, the OPB timer was used to profile the C code by running all six benchmarks. The OPB timer obtained the number of the clock cycles consumed by each part of the code. The profiling results obtained determine the functions that can benefit from acceleration. The profiling results for benchmark four and five are shown in Table 4.1. The table only shows the main functions and sub-functions, while ignoring minor and less computational functions.

4.3.5 Framing Technique

It can be seen from Table 4.1 that the wave propagation function consumes the most of the computation time. So, one technique that can be used to reduce the computation time of the search function is the framing technique [12]. In this technique, a window is constructed around the given terminal so that just the area inside the window is searched. The time complexity of this technique is CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 59

Benchmark 4 Benchmark 5 Functions #of Total %Total #of Total %Total Calls Cycles Cycles Calls Cycles Cycles X103 X103 Wave Propagation 136 6,632 80% 144 12,916 84% Retrace & clean up 136 1,613 19.5 % 148 1,764 11.5 % Reroute 0 0 0 4 424 2.76 % Rip up 0 0 0 4 244 1.6% Initialize arrays 1 24.5 0.3% 1 24.6 0.16 % Assign terminals 136 12 0.26 % 140 22.6 0.15 %

Table 4.1: The Profiling Results

0(X xY x D), where X is the number of the searched columns, Y is the number of the searched rows and D is the length of the path. Figure 4.10 shows the pseudo code of the search function with the framing technique, where the value of X determines the effective size of the frame. In this technique, four different size frames were implemented. The smallest one was with X = 0 called framing 0, while the largest one was with X = 3 called framing 3. If the path was not found, the value of X is increased until it equals to 5.

Figure 4.11 shows two terminals A and B with framing 1 technique. In this example, this function searches inside the shaded area only.

4.4 A Hardware/Software Co-Design Implemen­

tation

The profiling results presented in Table 4.1 show that the wave propagation phase consumes more time than the other functions. Moreover, this function searches CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 60

LEEFRAMING(P, B, NetID, Source(S), Target(T)) temp = 1; startX = the smallest of Sx and Tx - X; startY = the smallest of Sy and Ty - X; endX = the largest of Sx and Tx + X; endY = the largest of Sy and Ty + X; WHILE path-exists = False DO Loop:; FOR i= startX to endX DO FOR j= startY to endY DO IF B[i][j] = Unblocked THEN B[i][j] = temp; IF i = Tx & j = Ty THEN path-exists = True; EXIT WHILE; END IF ELSE IF P [i] [j] = NetID THEN path-exists = True; EXIT WHILE; END IF END IF temp = temp + 1; IF temp > 100 k path-exists = False THEN IF X = 5 THEN EXIT WHILE; ELSE GOTO Loop; END IF X++; ELSE EXIT WHILE; END IF END FOR END FOR END WHILE IF path-exists = True THEN RETRACE(P, B, NetID, S, T, temp); RETURN 1; ELSE unconnectednet [R]=NetID; R++ RETURN 0; END IF END LEEFRAMING

Figure 4.10: The Wave Propagation Function with Framing Technique CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 61

Figure 4.11: Framing 1 Technique the neighbors of each cell sequentially. Therefore, the wave propagation function was chosen to be implemented using a hardware accelerator. Figure 4.12 presents the MicroBlaze system architecture of the hardware/software co-design implemen­ tation.

4.4.1 Fast Simplex Link (FSL) Bus

A FSL bus is a uni-directional point-to-point communication channel bus between a MicroBlaze core and a hardware accelerator module. A MicroBlaze system offers up to eight master and slave FSL interfaces. The FSL interfaces can transfer data in two clock cycles to/from and to register file on the MicroBlaze core to the hardware accelerator implemented on the FPGA. Figure 4.13 shows the FSL block diagram.

A FSL bus has two peripherals a master and a slave. A master peripheral connected to the master port of the FSL bus sends data and control signals into the FIFO buffer; a slave peripheral connected to the slave port of the FSL port reads and pops data and control signals from the buffer [7]. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 62

Hardware Accelerator ! © £E) Block RAM i LMB Block RAM Controller

SLMSj j SLWB|

4 MicroBlaze core ' 1 ~wri fTsv 1 ' '

4DOPB] jtope}- 1 -1— On-chip Peripheral Bus (OPB) ,, —^

OPB OPB UART Timer MicroBlaze System FPGAChip

HyperTerminal Monitor PC

Figure 4.12: The MicroBlaze System for the Hardware/Software Co-design

FSL^WLCIk •* FSL_S_.CIk FSL_M_Data *- FSL_S_Data FIFO FSL_M_Control m- FSL_S„Control

FSL_M_Write f FSL_S_Read FSL M Full-* FSLS Exists

Figure 4.13: FSL Block Diagram [7] CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 63

4.4.2 The Hardware Accelerator Module

Recall that Lee's algorithm searches the entire grid to find the path between two terminals. However as described in Section 2.3.3, a framing technique can be used to reduce the computation time of the algorithm. In this dissertation, the framing window size was limited to a 8 x 8 grid. The decision was made for two reasons.

First, when the results produced by the software version of Lee's algorithm were analyzed the following was observed, the longest path for any benchmark was at most 8 steps in vertical or horizontal direction. The second reason for limiting the framing window to 8 x 8 was to reduce the amount of logic required to implement the framing window in hardware. However, although we have limited the window to 8 x 8, in general, any window size can be used.

Further details of implementing the framing window in hardware will be ex­ plained next. Figure 4.14(a) shows a 8 x 8 grid. S is the source terminal, T is the target terminal, and a shortest path between S and T is being sought.

7 8 9 10 11 12 13 7 8 10 11 12 13 111 6 7 8 9 10 11 12 13 6 7 9 10 11 12 13 5 6 7 8 9 10 11 12 5 6 8 9 10 11 12 4 5 6 7 8 9 10 11 4 5 7 8 9 10 11 3 4 5 6 7 8 9 10 3 4 6 7 8 9 10 2 3 4 S 6 7 8 9 2 3 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 4 5 6 7 8 m 1 2 3 4 5 6 7 1 3 4 5 6 7 RAM0 RAMI RAM2 RAM3 RAM4 RAM5 RAM6 RAM7

(a) (b)

Figure 4.14: The Grid Partitioning

It can be seen from the figure that the grid can be represented as eight RAM CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 64 blocks as shown in Figure 4.14(b). The eight RAMs can be searched in parallel for the target terminal, but starting with different values (i.e., the first value in the

RAMs). For example, the search in RAM 1 is started with 1, while in RAM 2 it is started with 2, etc.

The overall hardware accelerator module is explained by Figure 4.15. The ac­ celerator module is composed of a Finite State Machine (FSM) and an 8 x 8 grid.

8X8 Grid

-Sy»js

RAMs output

Finite State -FS!-,S„..ftead- Machine

-FSL„S_6xist—

-FSLSJata-

8 bits

1 bit

Figure 4.15: The Architecture of the Hardware Accelerator

The 8x8 grid consists of eight columns; each column represents a block RAM, a 2-input 8-bit adder, a 2-input 8-bit multiplexer, and a 8-input 8-bit multiplexer.

Figure 4.16 shows the architecture of the first column in the grid. The architecture CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 65 of the remaining columns is similar to the architecture of column 0, except that the input of the 8-input 8-bit multiplexer is different in order to produce different starting values. For example, the input of the 8-bit multiplexer in column 1 is in this sequence (1, 0, 1, 2, 3, 4, 5, 6) from SO to S7.

Source X

To the FSM 3 Ms S3 3 Data out S3 " o Si «.«o 35 *> CO O

<

Counter 0 o 6 Ms -V- FSL S Data Data in Read Done (from the FSM) address

Index (from the FSM)

Figure 4.16: The Architecture of the First Column in the Grid

The hardware accelerator module operates as follows:

1. First the hardware module reads the data from the Fast Simplex Link (FSL)

bus. It starts by getting the corresponding (x,y) coordinate of the source, the

corresponding (x,y) coordinate of the target, and the net ID.

2. A data array, that contains the blocked cells, unblocked cells and the previous

routed nets, is then stored in the block RAMs. The FSM reads the first

column that consists of eight numbers of the data array from the FSL bus CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 66

and stores them in RAM 0 as shown in Figure 4.17(a), the second column

is then read and stored in RAM 1. This process continues until the eight

columns are read from the FSL bus and stored in the eight block RAMs

(i.e., Read-Done equals one) as shown in Figure 4.17(b). In Figure 4.17 F

designates an unblocked cell "free cell", B is a blocked cell, I is the index of

the block RAMs, S is the source terminal and T is the target terminal.

3. Once the data array is stored in the RAMs, the index of the RAMs is assigned

to the corresponding Y of the source and the search is started in parallel in

the eight RAMs as shown in Figure 4.17(c). For example in Figure 4.17, the

corresponding (x,y) coordinate of the source is (0,0), while the corresponding

(x,y) coordinate of the target is (3,2). Consequently, the index of the block

RAMs is assigned to zero, and the search is started.

4. The corresponding X of the source is used as the selector of the 8-input mul­

tiplexer to produce the values of the first cells in the block RAMs (i.e., the

different starting values). At the same time, the first cells in the eight block

RAMs are checked by the FSM, and if they are free the output 8-input mul­

tiplexer of each column is written in the first cell of the eight block RAMs as

shown in Figure 4.17(c).

5. Each iteration, the FSM increases the value of the index and counter 0, that

is used to increment the starting value, by one and it checks the outputs of

the eight block RAMs.

• If the cell is free, the addition of counter 0 and the output of the 8-bit CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 67

multiplexer is written in the cell as shown in Figure 4.17(d).

6. Finally, if the index's value equals the corresponding Y of the target as shown

in Figure 4.17(e), the search is stopped and source, target and the data array

are sent back to the MicroBlaze processor to retrace the path. The data

array is sent back in the same way in which it was received (i.e., starting with

RAMO then RAMI and so on) as shown in Figure 4.17(f).

B B B 8 B B B F F F F Ill F F F F F F F F F F F F F F F F F F F F F F F F 1 2 3 M ^ RAMO RAMI RAM2 RAM3 RAMO RAMI RAM2 RAM3 RAMO RAMI RAM2 RAM3 (a) (b) (C)

B B B 8 B B B B B B F F F 3 4 5 3 4 5 - I B F F F F 2 3 4 2 3 4 5 5 1 2 3 4 1 2 3 1 2 3 4 4 1 2 3 1 2 1 2 3 SI •"* 3 RAMO RAMI RAM2 RAM3 RAMO RAMI RAM2 RAM3 RAMO RAMI RAM2 RAM3 (d) (e) (0

Figure 4.17: FSM Operation Procedure

4.5 Results

This section presents the results of the two implementations in terms of the FPGA usage and the speedup obtained. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 68

4.5.1 FPGA Usage

Table 4.2 shows the resource utilization of both the pure software and hardware/software co-design approaches. The hardware/software co-design approach tends to occupy more FPGA resources than the pure-software based design. The size of the software design is the same for the original code and for the code with the framing technique.

Available Pure-software Hardware/Software Resources Based Co-design Usage % Usage % Look-up Tables 88,192 1,371 1.6% 8,187 9.3% Slice Flip-Flops 88,192 595 0.7% 2,032 2.3% Occupied Slices 44,096 800 1.8% 4,440 10% Block RAMs 444 32 7.2% 32 7.2% MULT18X18s 444 3 0.7% 3 0.7% Total Equivalent Gates - 2,167,304 2,262,000

Table 4.2: The FPGA Usage

4.5.2 Speedup

Table 4.3 shows the clock cycles consumed for the different benchmarks and the maximum frequency of operation for the three proposed implementations. The targeted FPGA board can be operated at 40 MHz and 80 MHz. Therefore, the frequency of operation was set at 40 MHz. The pure-software can run at maxi­ mum frequency of 130.1 MHz, while the maximum frequency of operation for the hardware/software co-design is 125 MHz. The pure-software based system with the framing zero executes faster than the hardware/software co-design and the pure- CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 69 software based systems for benchmark 2, 4 ,and 6. However, the hardware/software co-design based system achieves the best speedup for the remaining three bench­ marks. This is because the path for some nets can not be found. As a result, the size of the frame has to be increased and the search has to be started again. On average, the hardware/software co-design achieves a speedup of 4.3 over the pure- software based design, and 1.75 times faster results than the pure-software based design with the framing technique.

Pure Pure-software H/S Benchmarks Software Framing Technique Co-design Framing 0 Framing 1 Framing 2 Benchmark 1 51,313 27,922 35,007 41,065 26,018 Benchmark 2 576,796 172,946 209,353 245,566 190,802 Benchmark 3 1,046,573 593,349 785,986 900,898 196,467 Benchmark 4 8,291,700 1,863,431 2,026,086 2,201,425 2,146,587 Benchmark 5 15,371,644 3,886,077 4,788,992 5,971,557 2,206,128 Benchmark 6 79,482,115 14,438,405 14,969,921 15,566,640 16,645,123 Max. Freq. 130.1 130.1 130.1 130.1 125 (MHz)

Table 4.3: The Consumed Clock Cycles and the Maximum Operating Frequency

Table 4.4 and Figure 4.18 show the obtained speed up by the hardware/software co-design and the pure-software implementation with the framing technique over the pure-software implementation. CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 70

Pure-software H/S Benchmarks Framing Technique Co-design Framing 0 Framing 1 Framing 2 Framing 3 Benchmark 1 1.8 1.5 1.2 1.1 2 Benchmark 2 3.3 2.8 2.3 2 3 Benchmark 3 1.8 1.3 1.2 0.9 5.3 Benchmark 4 4.5 4.1 3.8 3.5 3.9 Benchmark 5 4 3.2 2.6 2.1 7 Benchmark 6 5.5 5.3 5.1 4.9 4.8 Average 2.9 2.57 2.31 2.1 4.3

Table 4.4: The Obtained Speed up over pure-software

8

7

6 fe Benchmark 1 5 • Benchmark 2 • Benchmark 3 4 a Benchmark 4 3 • Benchmark 5 • Benchmark 6 Average 1 --

0 4- Framing 0 Framing 1 Framing 2 Framing 3

Figure 4.18: The Obtained Speed up CHAPTER 4. HARDWARE/SOFTWARE CO-DESIGN 71

4.6 Summary

In this chapter, three proposed architectures designed for Lee's routing algorithm were implemented, and mapped onto an FPGA chip.

• A pure-software based implementation of Lee's routing algorithm,

• A pure-software based implementation of Lee's routing algorithm with the

framing technique, and

• A hardware/software co-design implementation of Lee's routing algorithm.

The results of the three designs were presented and compared. On average, the hardware/software co-design achieves 4.3 times faster results over the pure-software based design, and 1.75 times faster results over the pure-software framing based design. Moreover, it can achieve the same quality solution as the pure-software based design.

Therefore, we can conclude that the hardware/software co-design approach can obtain a good balance between the flexibility and performance. Chapter 5

A Handel-C Custom RTL Implementation

In this chapter, a custom hardware implementation of Lee's routing algorithm using

Handel-C [1] is presented. Handel-C can be described as a superset of ANSI C, which uses the same syntax with added hardware parallelism. Handel-C is not a hardware description language; it is a programming language targeted to compile high-level algorithms directly into gate level hardware [1]. The development process has been performed on an IBM workstation with an Intel Xeon processor and

2.75 Gb of RAM. The computer runs Windows XP Professional and the Celoxica

DK Development Environment version 4.0 was used to utilize the Celoxica RC10

Spartan-3L development FPGA board.

72 CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 73

5.1 DK Design Flow

A complete design flow for implementing high-level language algorithms in RTL or as EDIF netlists is provided by the Celoxica DK Design Suite. Algorithms can be written directly in Handel-C, or converted to Handel-C from C or C++. Handel-C code can be compiled to VHDL or Verilog for RTL synthesis, or directly to EDIF for mapping onto FPGAs and PLDs [1]. Figure 5.1 shows the DK design flow.

ANSIC

h^^^^••',, DK Design Suite Handet-C

Create hardware \ Port C/C++ hardware \ components in ) V components to Handel-C / Handel-C J

-G

K Compile for simulation Simulate / debug

J

•-"Optimize hardwareX components for area j v or depth J

• Edit sources

Technology' M app in g' arid j Compile for hardware ( Logic Estimation [ T_

VHDL Verilog

3_[ Place and route

Figure 5.1: The DK Design Flow CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 74

5.2 Design Constraints

The following constraints were taken into account when the architecture was under development:

• Fit the entire architecture on one FGPA board.

• Obtain results with quality very close to the quality results of the pure-

software based implementation.

• Obtain a speed up over the pure-software based implementations.

5.3 Design Details

The main objective of this section is to present the steps, which have been performed to implement a custom hardware of Lee's routing algorithm using the Handel-

C. This approach relies on converting the pure-software version of Lee's routing algorithm that was developed on the PC to hardware implementation using Handel-

C [1]. Initially, the pure-software version of Lee's routing algorithm was converted to Handel-C without adding any parallelism (PSH). Subsequently, parallelism was added to the (PSH) to improve the performance (PPH). The last two steps will be explained in more details in the next Section.

5.3.1 Parallelizing Lee's Algorithm

The first strategy in achieving speedup was based on parallelizing all variable as­ signments wherever possible. In Handel-C, one clock cycle is required to assign CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 75 values to variables (that do not involve de-referencing a pointer or accessing an ar­ ray). Since the wave propagation function (refer to Section 4.3.2.2 for more details) is called when two pins need to be connected, and there are 20 - 30 variable assign­ ments per call, assigning all variables at the same time should reduce the number of the clock cycles to one per call. For example, the code in Figure 5.2 requires four clock cycles. On the other hand, the code in Figure 5.3 requires a single clock cycle; thus saving 3 clock cycles.

M==2 ; A==4 ; E==5 ; N==3 ;

Figure 5.2: Variable Assignment (a)

par{ M==2 ; A==4 ; E==5 ; N==3 ; }

Figure 5.3: Variable Assignment (b)

If this piece of code in a function is called several times during the process, or worse, inside a for-loop that is called hundred of times within the function, the speedup can be significant.

The second strategy used was to parallelize the wave propagation phase in which CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 76 the four neighbors of a cell are searched sequentially using four if statements. By having the four if statements run concurrently (i.e., search the four neighbors of a cell concurrently), the total number of clock cycles is reduced.

The targeted FPGAs have modest capacities and therefore there is a limit to the size of the benchmarks that can be tested. Moreover, in an attempt to map all benchmarks to the board, modifications (described below) were made to the original Handel-C implementation to reduce the amount of logic required. In the original Handel-C implementation, two arrays are used to obtain the path between the given terminals using (B array) and to display the final solution of the system

(P array). So, to reduce the size of the design (B array) was removed and the (P array) was used for the display and the search purposes. When using PPH, two text files (the netlist text file and placement text file) are downloaded to architecture and stored into two different arrays. This requires more space to store them. In the modified implementation, the benchmarks were altered so that one text file contains the same data as the netlist file except that it contains the position of the pins on the gird instead of the pins' ID. In this case, one array is needed to store the benchmark. Although, those two techniques reduced the amount of logic required the largest benchmark could not be mapped to the targeted FPGA.

An attempt was made to add parallelism to the Retrace and clean up phase.

However, results obtained were not superior to the results obtained by the pure- software implementation. This is because the Retrace and clean up phase must retrace the labeled cells sequentially. Therefore, only the parallel variable assign­ ment was added to the Retrace and clean up phase. CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 77

5.3.2 Input/Output Data

A host process that runs on the PC was used to transfer the data from/to the

Celoxica RC10 development FPGA board through on RS-232 serial terminal. The host process runs C program that reads the benchmarks (text file/s) and sends it to the board. It then reads the output, which contains a gird showing the connection of the nets and the total number of the clock cycles and displays it on the monitor.

The maximum data width that can be transfered to the board through the RS-232 is eight bits. As a consequence, a code was added to transfer any data higher than

256 in a sequence of packages with an eight bit width.

5.4 The Custom Hardware vs. The MicroBlaze

Based Implementations

This section presents the results of the custom hardware approach in terms of the

FPGA usage and speed up. Furthermore, a comparison between this approach and the MicroBlaze approach will be presented as well.

5.4.1 Speedup

Table 5.1 shows the clock cycles consumed by the different benchmarks and the maximum frequency of operation. The frequency of operation of the MicroBlaze based implementations is 40 MHz, while the operation frequency of the custom hardware is 10 MHz. CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 78

Pure H/S Handel-C Benchmark software Co-design Benchmark 1 51,313 26,018 3,803 Benchmark 2 576,796 190,802 42,716 Benchmark 3 1,046,573 196,467 57,283 Benchmark 4 8,291,700 2,146,587 604,674 Benchmark 5 15,371,644 2,206,128 811,414 Benchmark 6 79,482,115 16,645,123 - Max. Freq. 130.1 125 11 (MHz)

Table 5.1: The Consumed Clock Cycles

Table 5.1 shows that the custom hardware approach consumes significantly less clock cycles compared to the pure-software and the hardware/software co-design implementations. However, when the actual clock speed of the implementations is taken into account, the execution time of the three implementations appears as shown in Table 5.2. This was found by taking the clock cycles consumed by the three designs and dividing them by the frequency of operation of each design. It is

Pure H/S Handel-C Benchmark software Co-design Benchmark 1 1.28 0.65 0.38 Benchmark 2 14.4 4.8 4.3 Benchmark 3 26 4.9 5.7 Benchmark 4 207 53.7 60 Benchmark 5 385 55 81.1 Benchmark 6 1,987 416 - Average - 4.3 3.9 Speed up

Table 5.2: The Actual Execution Time of the Three Implementations in mili Sec. CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 79 clear from Table 5.2 that the custom hardware implementation is faster than the hardware/software co-design approach for the first two benchmarks; however, the hardware/software co-design approach is faster when the benchmarks increase in size. On average, the hardware/software co-design achieves a speed up of 4.3x over the pure-software based design, and l.lx speedup over the custom hardware based design; while, the custom hardware achieves a 3.9x speedup over the pure-software design.

An important factor that needs to be taken into account is the fabrication technology of the FPGAs. The hardware/software co-design approach was im­ plemented on a 130 nm, FPGA, while the Handel-C was implemented on a 90 nm

FPGA. Therefore, one could conclude that the run times for the hardware/software co-design approach would be even lower if implemented on a 90 nm FPGA. (Note: it was not possible to implement both approaches on the same FPGA/board since the synthesis tools available to us did not allow for this.)

5.4.2 FPGA Usage

In comparing the two implementation approaches, it was found that the total equiv­ alent gates of the Handel-C implementation is significantly less than the total equiv­ alent gates of the HW/SW co-design MicroBlaze implementations. This was ex­ pected, because in the latter, a complete system with MicroBlaze processor, block

RAMs, buses and timer was implemented. Table 5.3 summarizes the FPGA usage of the three implementations. CHAPTER 5. A HANDEL-C CUSTOM RTL IMPLEMENTATION 80

Resource Pure- H/S Pure- software Co-design Hardware LUTs 1,371 8,187 20,737 Flip-Flops 595 2,032 1,298 Slices 800 4,440 10,787 BRAMs 32 32 1 Equivalent 2,167,304 2,262,000 394,567 Gates

Table 5.3: The FPGA Usage

The other comparison figure is the quality of the obtained solution; the ex­ perimental results show that the hardware/software co-design approach and the custom hardware approach produce the same quality solutions as the pure-software approach.

5.5 Summary

In this chapter, the custom hardware implementations of Lee's routing algorithm using Handel-C was implemented. The experimental results show that the co-design approach and the custom hardware approach produce the same quality solutions as the pure-software approach. The custom hardware approach achieves an average speedup of 3.9x over the pure-software based approach, while the hardware/software co-design approach achieves a 4.3x speedup. Chapter 6

Configurable Processors Implementation

In the previous two chapters, the capability of improving the performance of

Lee's routing algorithm using both the hardware/software co-design and the custom hardware approaches were investigated and presented. However, in this chapter,

Application Specific Instruction-set Processors (ASIP), or as we will refer to them in this chapter by configurable processors, will be used to speedup the performance of Lee's routing algorithm. The main reason of investigating the ASIPs is that they combine the advantages of both GPPs and reconfigurable processors. This work targets the Tensilica Xtensa family of processors [28]. The Xtensa processors are configurable cores that can be initialized and described at the micro-architectural level. Furthermore, they can be extended at the instruction set level.

81 CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 82

6.1 Tensilica Configurable Processors

Traditional general-purpose processors, such as RISC processors, are used in many applications, but they may inefficient for complex System-on-Chip (SoC) designs.

Even though those processors might work for complex applications, however they are not capable to run fast enough to be equivalent to the difficult tasks in present embedded SoCs. Furthermore, the requirements for lower power consumption, small area and good performance are increasing as the algorithms are getting more complicated. ASIPs, on the other hand, are seen as a middle ground between gen­ eral purpose processors and ASICs, where modification on both the instruction set and data path levels provides both the application specific required and the easiness of design.

6.1.1 Xtensa Processors

Tensilica Inc. through its Xtensa configurable extensible processors led the way of creating Application Specific Instruction set Processors. The processors could be configured using the Tensilica processor configuration-generation design flow known as Xtensa Processor Generator (XPG). The XPG permits designers to describe the processor at the micro-architectural block. This configurability allows the sizing of the processor to meet the constraints of the targeted application. The Xtensa

Xplorer Integrated Development Environment (IDE) can be used by designers to evaluate results in terms of power, area, and performance based on different con­ figurations and new instructions.

Extensibility in the Tensilica Xtensa processors is carried out by identifying the CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 83 main bottlenecks of the developed C/C++ code and rewriting them in the Tensilica

Instruction Extension (TIE) language, a Verilog-based language. The advantage of the TIE language is that it describes new instructions, registers and register files of any size, in addition to describing new in-out ports. The TIE compiler is used to compile the TIE extensions. The TIE compiler then generates the necessary files required to modify the software tool and extend the instruction-set simulator. In addition, it provides an estimation of the additional gates generated.

An alternative flow of utilizing the Tensilica approach for SoC design is run­ ning the XPRES (Xtensa Processor Extension Synthesis) compiler. The XPRES compiler automatically produces one or more TIE files to enhance the performance of the targeted application, while providing trade-off between the performance ob­ tained against the additional area used. The resulting generated TIE files could be modified further through writing other TIE extensions. The XPRES Compiler carries out exploration of the various configurations in a sufficient amount of time, subject to the algorithm complexity. The fast exploration permits the designer to choose between a variety of both the automatically and manually generated TIE instructions.

6.1.2 Design Flow

Tensilica provides the SoC developers with an integrated environment that allows them to perform different tasks of the design flow. The first step in the design flow involves writing the C/C++ code for the desired application. The processor generator interface is then used to generate the processor. In this step, the tools CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 84 provide the developer with the capability to specify the instruction-set options, memory, peripherals and interfaces. Following the generation of the desired proces­ sor, the main bottlenecks of the application are identified by profiling the C/C++ code. The next step is to extend the processor ISA through the TIE language. The written TIE code is compiled and the needed libraries are created and attached automatically to provide support inside the IDE environment. The created TIE instructions could then be easily included inside the C code. The last step in the flow is profiling the code again to document the speed-up obtained over the pure processor ISA. Figure 6.1 illustrates the different steps of the design flow.

Xtensa Processor

Processor C/C++ code Compile •> *i.' —a Profile configuration Generator

TIE TIE code Compile H T~

'HE^ Update Optimization r. ISA Path Output Files Xtensa Processor Configured Tailored Software Processor Development ;— "i" i— Reprofile RTL Tool Suite Generator

Figure 6.1: Xtensa Design Flow [8]

6.2 Design Details

The main objective of this section is to present the capability of ASIPs to im­ plement an efficient version of Lee's routing algorithm. The results obtained are CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 85 then compared to the results of the hardware/software co-design and the custom hardware approaches described earlier.

The ASIP approach relies on running the software version of Lee's algorithm on the Xtensa processor. The software is then tested and profiled to define the main bottlenecks in the design. The Tensilica Instruction Extension (TIE) language is finally used to convert the bottlenecks into a specific instruction to enhance the performance.

6.2.1 Design Environment and Overall Architecture

The IDE (The Xtensa Xplorer CE) environment was used to implement the design.

Table 6.1 presents the specification of the targeted Xtensa Tensilica configured processor.

Parameter Setting Note(s) Xtensa ISA Version X6.0 MAC/MUL units No Floating-Point Unit No Zero-overhead Loop Instructions Yes Pipeline Length 5 Instruction and Data Cache Size 1024 Bytes Instruction and Data Cache Line Size 16 Bytes Xtensa Exception Architecture XEA2 System RAM Size 4M System ROM Size 128K Process 1301v Core Speed 358MHz Number of Gates 48Kgate Estimated Functional Unit and Global Clock Gating Yes

Table 6.1: Xtensa Processor Configuration Detail CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 86

6.2.2 Profiling

In order to identify performance bottlenecks, the C code was profiled using all six benchmarks presented in Chapter 2. The profiling results obtained determine the functions that can benefit from extending the ISA through writing a specific instruction using TIE language. The profiling results for benchmarks four and five are shown in Table 6.2 1. The table only shows the main functions and sub- functions, while neglecting minor and less computational functions.

Benchmark 4 Benchmark 5 Functions #of Total %Total #of Total %Total Calls Cycles Cycles Calls Cycles Cycles X103 X103 Wave Propagation 136 7,189 78.9 % 144 13,171 83.6 % Retrace & clean up 136 1,078 11.8 % 148 1,173 7.45 % Reroute 0 0 0 4 355.9 2.26 % Rip up 0 0 0 4 171.4 1.08 % Assign terminals 136 5.1 0.06 % 140 5.9 0.04 % Initialize arrays 1 3.9 0.04 % 1 3.9 0.025 %

Table 6.2: The Profiling Results

It can be seen from Table 6.2 that the wave propagation function is still the major bottleneck. Utilizing the available tools from Tensilica the XPRES compiler failed to generate any default TIE configurations. The TIE language was then used to write a specific instruction for the wave propagation phase to further enhance the performance.

The profiling results of the other benchmarks obtained similar results. CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 87

6.3 Results

The design was verified and simulated using the Xtensa Xplorer CE version 2.0.0.

The results obtained by both the pure ISA processor and the extended ISA processor are similar to the results obtained by the pure software implementation of Lee's algorithm. The design was run on the Xtensa 6 processor targeting a 130 nm lv process.

6.3.1 Speed and Area

Table 6.3 lists the clock cycles consumed using the different benchmarks and speedup obtained by both the pure ISA processor and the extended ISA processor. Both processors can run at a maximum frequency of 358 MHz.

The Pure ISA The Extended Speed Up Benchmarks Processor ISA Processor Benchmark 1 147,745 106,861 1.38 Benchmark 2 823,431 320,468 2.45 Benchmark 3 1,292,338 376,872 3.43 Benchmark 4 9,108,940 2,056,841 4.42 Benchmark 5 11,480,018 2,446,622 4.7 Benchmark 6 79,232,855 10,940,783 7.24

Table 6.3: The Consumed Clock Cycles and the Speed up Obtained over the Pure ISA Processor

On average, the extended ISA processor achieves a speedup of 3.9x over the pure ISA processor.

The estimated total equivalent gates of the pure ISA processor was 48K gates.

On the other hand, the added TIE instruction increases the gate count by 8,699 CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 88 gates, which contributed to 18% only increase in the original core area.

6.4 Overall Comparison

In this section, a comparison between the results obtained by the three proposed approaches will be presented in terms of area, and performance.

6.4.1 Speedup

Table 6.4 shows the clock cycles consumed by the three approaches for the dif­ ferent benchmarks and the maximum frequency of operation. The frequency of operation of the MicroBlaze based implementations is 40 MHz, while the operation frequency of the custom hardware is 10 MHz. On the other hand, the frequency of the operation of the Tensilica processor implementation is 358 MHz. The Tensilica approach targets an ASIC implementation, which emphasizes the highest value of the frequency of operation.

Results obtained in Table 6.4 indicate that the custom hardware approach con­ sumes significantly less clock cycles compared to the MicroBlaze and the Tensilica implementations. However, when the actual clock speed of the implementations is considered the execution time of the three approaches appears as shown in Table

6.5. This was found by taking the clock cycles consumed by the three approaches and dividing them by the frequency of operation of each approach. It is clear from Table 6.5 that the Tensilica extended approach is fastest among the three approaches, which can easily be justified by the high clock frequency used. Table CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 89

Pure H/S Handel-C The Pure ISA The Extended Benchmark software Co-design Processor ISA Processor Benchmark 1 51,313 26,018 3,803 147,745 106,861 Benchmark 2 576,796 190,802 42,716 823,431 320,468 Benchmark 3 1,046,573 196,467 57,283 1,292,338 376,872 Benchmark 4 8,291,700 2,146,587 604,674 9,108,940 2,056,841 Benchmark 5 15,371,644 2,206,128 811,414 11,480,018 2,446,622 Benchmark 6 79,482,115 16,645,123 - 79,232,855 10,940,783 Max. Freq. 130.1 125 11 358 (MHz) Operation Freq. 40 10 358 (MHz)

Table 6.4: The Consumed Clock Cycles

Pure H/S Handel-C The Pure ISA The Extended Benchmark software Co-design Processor ISA Processor Benchmark 1 1.28 0.65 0.38 0.41 0.298 Benchmark 2 14.4 4.8 4.3 2.3 0.937 Benchmark 3 26 4.9 5.7 3.6 1.05 Benchmark 4 207 53.7 60 25.4 5.74 Benchmark 5 385 55 81.1 32 6.83 Benchmark 6 1,987 416 - 221.3 30.6 Average - 4.3 3.9 7.6 33.6 Speed up

Table 6.5: The Actual Execution Time of the Three Approaches in m Sec. CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 90

6.6 presents the average speedup obtained by both of Tensilica processors over the hardware/software co-design the custom hardware approaches.

Approach The Pure ISA The Extended Processor ISA Processor H/S co-design 1.79 7.81 Handel-C 1.86 8.61

Table 6.6: The Speed up obtained by Tensilica Approach over the H/S and Handel- C Approaches.

6.4.2 Area

In comparing the three implementation approaches, it was found that the total equivalent gates of the configurable processor (Tesilica), which is 56,699 gates, is significantly less than the total equivalent gates of the other two implementations.

On the other hand, both of the hardware/software co-design and the custom hard­ ware approaches required 2,262,000 and 394,567 gates respectively.

The other comparison figure is the quality of the obtained solution; the experi­ mental results show that the three approaches produce the same quality solutions as the pure-software approach.

6.5 Summary

In this chapter, the configurable processor implementation of Lee's routing algo­ rithm using the Tensilica tools was investigated. The experimental results show that the three approaches produce the same quality solutions as the pure-software CHAPTER 6. CONFIGURABLE PROCESSORS IMPLEMENTATION 91 approach. The configurable approach obtains an average speedup of 33.6x over the pure software, while achieves a speedup of 7.81x and 8.61x over the hard­ ware/software co-design and the custom hardware respectively.

It is important to note that the proposed architectures in this thesis are based on either a pure RTL implementation (using Handel-C) or a hardware/software co- design approach that is either a tightly coupled (Tensilica ASIP) or loosely coupled

(MicroBlaze).

Therefore, it is not easy to compare our work in terms of performance and area with previously published work [6]. However, we can see from our proposed work that we can achieve better flexibility as the nature of Lee's algorithm changes. Chapter 7

Conclusions

Since current VLSI circuits are very dense, the physical design process is becoming increasingly complex and slow. This has set forth requirements for faster and more efficient techniques for physical design automation. Lee's routing algorithm [13] has proved to be an efficient technique for VLSI circuit routing, and it is widely used for many of the physical design automation applications. As the complexity and size of digital circuits increases, algorithms such as Lee's are becoming slow and less efficient. The main goal of this dissertation was to accelerate the performance of Lee's routing algorithm. Thus, three approaches were proposed and considered.

The first approach was based on a hardware/software co-design strategy, while the second was a custom hardware implementation using Handel-C [1]. In the third approach, the capability of enhancing the performance of the Lee's algorithm through utilizing a Tensilica application specific instruction-set processor (ASIP)

[28] was also investigated. The experimental results indicate that the three proposed approaches produce the same quality solutions as the pure-software approach. The

92 CHAPTER 7. CONCLUSIONS 93 co-design approach achieved an average speedup of 4.3x over the pure-software based approach, while the custom hardware approach achieved a 3.9x speedup.

The configurable approach obtained an average speedup of 33.6x over the pure software implementation, while achieving a speedup of 7.81x and 8.61x over the hardware/software co-design and the custom hardware respectively.

It is important to note that the proposed architectures in this thesis are based on either a pure RTL implementation (using Handel-C) or a hardware/software co- design approach that is either a tightly coupled (Tensilica ASIP) or loosely coupled

(MicroBlaze).

Therefore, it is not easy to compare the proposed work in terms of performance and area with previously published work [6]. However, it can be sees from the proposed work that it can achieve better flexibility as the nature of Lee's algorithm changes.

7.1 Future Work

The main limitation of the proposed work is due to small benchmarks used to test the proposed architectures, since the targeted FPGA boards had limited re­ sources. A high-capacity FPGA chip such as Virtex-4 XC4VFX140 can be used in the future. In addition, the density of the transistors double approximately every eighteen months according to Moore's law, and the FPGA capacity will significantly increase. Therefore, the proposed hardware/software co-design can be applied to more realistic benchmarks in the future.

The parallel structure that was used in the hardware/software co-design can be CHAPTER 7. CONCLUSIONS 94 further expanded and enhanced, and this will hopefully accelerate the performance even further. Furthermore, the proposed work should be expanded to support multi-layer routing problems. Bibliography

[1] Celoxica, "http://www.celoxica.com".

[2] P. Graham M. Gohkale, Reconfigurable Computing: Accelerating Computation

with Field-Programmable Gate Arrays, Springer, 2005.

[3] M. Handa and R. Vemuri, "Hardware Assisted Two Dimensional Ultra Fast

Placement", in IPDPS, pp. 140-147, Apr. 2004.

[4] A. DeHon, R. Huang and J. Wawrzynek, "Hardware-Assisted Fast Routing",

in FCCM, pp. 205-215, Apr. 2002.

[5] J. Nestor, "L3: An FPGA-based multilayer maze routing accelerator", Micro­

processors and Microsystems, vol. 29, n. 2-3, pp. 87-97, Apr. 2005.

[6] J. Nestor and J. Lavine, "L4: An FPGA-Based Accelerator for Detailed Maze

Routing", in FPL, pp. 357-362, Sep. 2007.

[7] Xilinx, "http://www.xilinx.com".

[8] A. Sghaier, "WiMAX Implementation on RCS", Master's thesis, University

of Guelph, 2008.

[9] Amirix, "http://www.amirix.com".

95 BIBLIOGRAPHY 96

[10] RC10 Manual, "http://www28.cs.kobe-u.ac.jp/kawapy/class/proj/proj07/

RC10Manual.pdf".

[11] G. Moore, "Cramming more Components onto Integrated Circuits", Electron­

ics, vol. 38, n. 8, pp. 82-85, Apr. 1965.

[12] N. Sherwani, Algorithms for VLSI Physical Design Automation, Kluwer Aca­

demic Publishers, 1993.

[13] C. Lee, "An Algorithm for Path Connection and its Application", IRE Trans­

actions on Electronics Computers, vol. 10, pp. 346-365, 1961.

[14] Semiconductor Industry Association, "National Technology Roadmap for

Semiconductors", 1997.

[15] The sixteenth Reconfigurable Architectures Workshop (RAW09),

"http://www.ece.lsu.edu/vaidy/raw/".

[16] The Twenty Second Canadian Conference on Electrical and Computer Engi­

neering (CCECE09), "http://www.ieee.ca/ccece09/".

[17] S. Sait and H. Youssef, VLSI Physical Design Automation: Theory and Prac­

tice, World Scientific Publishing Co. Pte. Ltd, 1999, reprinted 2001.

[18] T.C. HU and E. Kuh, "Theory and Concepts of Circuit Layout", In VLSI

Circuit Layout:Theory and Design, pp. 3-18, New York 1985.

[19] A. Vannelli A. Yang and S. Areibi, "An ILP Based Hierarchical Global Routing

Approach for VLSI ASIC Design", Journal of Optimization Letters (Springer),

vol. 1, n. 3, pp. 281-297, June 2007. BIBLIOGRAPHY 97

[20] L. Abel, "On the Ordering of Connections for Automatic Routing", in IEEE

Transactions on Computers, pp. 0-21:1227-1233, Nov. 1972.

[21] W. Dees and P. Krager, "Automated Rip-up and Reroute Techniques", in

Proceedings of 19th Design Automation Conference, pp. 432-439, 1982.

[22] F. Hwang, "An 0(n log n) Algorithm for Rectilinear Steiner Trees", Journal

of the Association for Computing Machinery, vol. 26, n. 1, pp. 177-182, Apr.

1976.

[23] S. Hauck K. Compton, "Reconfigurable Computing: A Survey of Systems and

Software", ACM Computing Surveys, vol. 34, n. 2, pp. 171-210, June 2002.

[24] Y. Li, "Hardware-Software Co-design of Embedded Reconfigurable Architec­

tures", in Design Automation Conference, pp. 507-512, 2000.

[25] G. Amdahl, "Validity of the Single-processor Approach to Achieving Large

Scale Computing Capabilities", in AFIPS Conference Proceedings, pp. 483-

485, 1967.

[26] P. Willson, Design Recipes for FPGAs, MPG Books Ltd, Bodmin Cornwall,

2007.

[27] L. McMurchie and C. Ebeling, "PathFinder: A Negotiation-based

Performance-driven Router for FPGAs", in FPGA, pp. 111-117, 1995.

[28] Tensilica, "http://www.tensilica.com".

[29] ISPD98/IBM, "http://www.ece.ucsb.edu/ kastner/labyrinth/ benchmarks/". BIBLIOGRAPHY 98

[30] P. Chan and M. Schlag, "Acceleration of an FPGA Router", in IEEE Sym­

posium on FPGAs for Custom Computing Machines, pp. 175-181, Apr. 1997.

[31] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O. Rowhani,

G. Varghese, J. Wawrzynek and A. DeHon, "HSRA: High-Speed, Hierarchical

Synchronous Reconfigurable Array", in FPGA, pp. 125-134, Feb. 1999.

[32] R. Huang, J. Wawrzynek and A. DeHon, "Stochastic, spatial routing for

hypergraphs, trees, and meshes", in FPGA, pp. 78-87, Feb. 2003.

[33] T. Watanabe, H. Kitazawa and Y. Sugiyama, "A Parallel Adaptable Routing

Algorithm and its Implementation on a Two-Dimensional Array Processor",

IEEE Trans, on CAD of Integrated Circuits and Systems, vol. 6, n. 2, pp. 241-

250, Mar. 1987.

[34] Y. Won, S. Sahni and Y. El-ziq, "A hardware accelerator for maze routing", in

DAG '87: Proceedings of the 24th ACM/IEEE conference on Design automa­

tion, pp. 800-806, 1987.

[35] R. Rutenbar, T. Mudge and D. Atkins, "A Class of Cellular Architecture to

Support Physical Design Automation", IEEE Transactions on CAD, vol. 3,

pp. 264-278, Oct. 1984.

[36] K. Suzuki, Y. Matsunaga, M. Tachibana and T. Ohtsuki, "A Hardware Maze

Router with Application to Interactive Rip-Up and Reroute", IEEE transac­

tions on computer-aided design, vol. 5, pp. 466-476, Oct. 1986. BIBLIOGRAPHY 99

[37] A. Iosupovici, "A Class of Array Architecture for Hardware Grid Routers",

IEEE transactions on computer-aided design, vol. 5, pp. 245-255, Apr. 1986.

[38] T. Ryan and E. Rogers, "An Isma Lee Router Accelerator", IEEE Des. Test,

vol. 4, n. 5, pp. 38-45, Oct. 1987.

[39] R. Venkateswaran and P. Mazumder, "A hexagonal array machine for multi­

layer wire routing", IEEE Trans, on CAD of Integrated Circuits and Systems,

vol. 9, n. 10, pp. 1096-1112, Oct. 1990.

[40] J. Nestor, "A new look at hardware maze routing", in GLSVLSI '02: Pro­

ceedings of the 12th ACM Great Lakes symposium on VLSI, pp. 142-147, Apr.

2002.

[41] J. Nestor, "FPGA Implementation of a Maze Routing Accelerator", in FPL,

pp. 992-995, Sep. 2003. Appendix A

Glossary

ASIC : Application Specific

ASIP : Application Specific Instruction-set Processor

BRAM : Block RAM

DSP : Digital Signal Processing

EDK : Embedded Development Kit

FPGA : Field Programmable Gate Array

FSL : Fast Simplex Link

GPP : General-purpose Processor

H/S : Hardware/Software

IDE : Integrated Development Environment

ISA : Instruction Set Architecture

LMB : Local Memory Bus

LUT : Look-Up Table

OPB : On-Chip Peripheral Bus

RC : Reconfigurable Computing

100 APPENDIX A. GLOSSARY 101

RTL : Register Transfer Language SoC : System on Chip TIE : Tensilica Instruction Extension language UART : Universal Asynchronous Receiver/Transmitter VHDL : Very High Speed Integrated Circuit Hardware Description Language Appendix B

AMIRIX AP1000 FPGA PCI Development Board

Amirix AP 1000 FPGA development board shown in Fig. B.l was used as a pro­ totyping board for the MicroBlaze implementations.

Figure B.l: Amirix AP1000 FPGA PCI Developmemnt Board [9]

Amirix AP 1000 FPGA development board is a PCI card. The board features:

• A Xilinx Virtex-II Pro XC2VP100 FPGA packaged in a 1704-pin ball grid

array (BGA).

102 APPENDIX B. AMIRIX AP1000 FPGA PCI DEVELOPMENT BOARD 103

• Two banks of 64 MB DDR SDRAM.

• Two banks of 2 MB SRAM.

• 16 MB program flash.

• 16 MB configuration flash.

• 10/100 Ethernet interface.

• RS-232 port. Appendix C

RC10

The RC10 board shown in Fig. C.l was used as a prototyping board for the custom hardware approach using Celoxica Handel-C.

Options! fetwrnal ADC'Inputs Power

JTAQ Connacto C&mmn C&nnscter

Stwwl OuipMt

ATA Gonndetot Joystick Survo Motui CfsnHuftUf

Figure C.l: The RC10 Board [10]

104 APPENDIX C. RC10 105

The RC10 board has a Xilinx Spartan 3 XC3S1500L-4-FG320. The device is directly connected to the following devices:

• USB .

• Video output.

• Audio output.

• RS-232 port.

• PS/2 connector.

• Expansion header.

• CAN bus connector.

• Servo motor connector.

• Analogue to Digital convertors.

• 8 green LEDs.

• 2 seven segment LED displays.

• 5-way micro joystick.

• TFT flat screen (if fitted). Appendix D

The Netlist and the Placement Files

This appendix presents the structure of the netlist file and the placement file.

D.l The Netlist File

This file contains the grid size, the number of nets, the number of pins, and the pins' ID of each net. The files structure is showing in Figure D.l.

D.2 The Placement File

This file contains the corresponding X and Y position of each pin on the grid.

Figure D.2 shows the structure of the placement file.

106 APPENDIX D. THE NETLIST AND THE PLACEMENT FILES 107

X Chip Size Y Chip Size Number of Pins(n) Number of Nets(m)

Net 1: Number of Pins for this Net Pin-ID Pin-ID ... Net 2: Number of Pins for this Net Pin-ID Pin-ID ...

Net m: Number of Pins for this Net Pin-ID Pin-ID ...

Figure D.l: The Netlist File

Pin 1 X Y Pin 2 XY

Pin n X Y

Figure D.2: The Placement File