ABSTRACT

WIDIALAKSONO, RANDY HARI. Three-Dimensional Integration of Heterogeneous Multi-Core Processors. (Under the direction of Dr. Paul Franzon and Dr. W. Rhett Davis.)

This dissertation will explore the advantages of and design methodology for 3D integration in the context of building heterogeneous multi-core processors. The processor features a fast thread migration and cache core decoupling scheme. First, we present empirical results in a commercial 130 nm process. We demonstrate that the 3D implementation of a heterogeneous multi-core processor consumes 31% less power and 22% shorter average wirelength compared to the 2D implementation. Second, this work presents the physical design methodology used for the tape-out of a die-stacked 3D-IC processor. Finally, we propose a new algorithm and methodology for timing-driven 3D-IC via assignment. Experiment results show up to 30% improvement in total negative slack compared to a via assignment algorithm with total wirelength objective function. © Copyright 2016 by Randy Hari Widialaksono

All Rights Reserved Three-Dimensional Integration of Heterogeneous Multi-Core Processors

by Randy Hari Widialaksono

A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy

Computer Engineering

Raleigh, North Carolina

2016

APPROVED BY:

Dr. Eric Rotenberg Dr. Agnes Szanto

Dr. Paul Franzon Dr. W. Rhett Davis Co-chair of Advisory Committee Co-chair of Advisory Committee DEDICATION

Dedicated to my wife and my parents who instilled the importance of pursuing and applying knowledge.

ii BIOGRAPHY

Randy Widialaksono was born in Jakarta, Indonesia. He completed Bachelors in Electrical

Engineering at Institut Teknologi Bandung, Indonesia, in 2009. He started his Ph.D. in Com- puter Engineering at North Carolina State University in 2010. His research focus is on design implementation methodologies for realizing 3D integrated circuits. He also maintains an active interest in computer architecture, digital VLSI design, and machine learning. He has been a

IEEE member since 2008.

iii ACKNOWLEDGEMENTS

First of all, I would like to thank both of my advisors, Dr. Paul Franzon and Dr. W. Rhett Davis for being supportive and providing the opportunity for a rewarding research project. I would also like to thank the following faculty: Dr. Eric Rotenberg for teaching advanced computer micro-architecture concepts. Dr. Krishnendu Chakrabarty at Duke University for collaborating on our research project and welcoming me to his DFT course and research. Dr. Agnes Santo for feedback on the assignment problem and teaching computer algebra.

I would like thank the following people for their contribution that made this dissertation possible: Dr. Steve Lipa for his tremendous contributions in deploying the design kit infras- tructure, signing off our tapeouts and developing numerous EDA utilities. Zhenqian Zhang for being a great colleague throughout the research project and helping in the final days of the tapeout performing timing ECO fixes. Bagus Wibowo for collaboration on the timing-driven via assignment experiments. Wenxu Zhao for collaboration on papers and helpful technical discussions. Josh Ledford for developing customized I/O pads for the 3D-IC tapeout process.

Jongbeum Park for sharing insights on device and interconnect scaling. Kirti Bhanushali and Dr.

T. Robert Harris, for proofreading and presentation feedback. Thor Thorollfsson for mentoring and sharing tape-out/Ph.D. experience. Elliott Forbes for conducting chip bringup of the 2D prototype. Rangeen for taking part in physical design for tapeouts and collaboration on papers.

Brandon Dwiel for establishing the processor implementation. Vinesh Srinivasan for providing and verifying the 3D processor netlist. Qouniitah Fadhilah, for full support throughout the doctoral program, proofreading, and assisting in the graphics and typesetting department.

iv TABLE OF CONTENTS

LIST OF TABLES ...... vi

LIST OF FIGURES ...... vii

Chapter 1 Introduction ...... 1 1.1 Overview of the Following Chapters ...... 2 1.2 Abbreviations...... 3

Chapter 2 3D Integration ...... 4 2.1 3D Multi-core Processor...... 6 2.2 Challenges for 3D...... 7 2.3 Design for Test...... 8 2.3.1 3D DFT Overhead...... 8 2.4 On-chip Timing Measurements ...... 9 2.5 Routability Improvement ...... 10 2.6 3D Via Assignment...... 12

Chapter 3 3D-IC Physical Design Methodology ...... 14 3.1 Fabrication Process Technology...... 14 3.2 Processor Architecture...... 16 3.2.1 2D Prototype...... 17 3.2.2 3D Architecture ...... 18 3.3 Design Flow...... 19 3.4 Floorplan ...... 20 3.5 Via Assignment...... 21 3.5.1 Visualization Tool ...... 22 3.6 Power Delivery...... 23 3.7 Timing...... 26 3.7.1 Timing Constraints and Analysis...... 26 3.7.2 Inter-tier Clock Skew Balancing...... 27 3.8 Physical Verification...... 28 3.8.1 Design Rule Checks ...... 28 3.8.2 Connectivity Checks...... 28 3.9 Physical Design Metrics ...... 28

Chapter 4 3D-IC Benefits Case Study ...... 30 4.1 2D vs 3D Register File...... 31 4.1.1 Experimental Framework ...... 31 4.1.2 Floorplan...... 31 4.1.3 Area Comparison...... 32 4.1.4 Power Analysis...... 32 4.1.5 Face-to-face Via Pitch Analysis...... 34

v 4.1.6 Routing Congestion ...... 35 4.1.7 Wirelength Analysis...... 38 4.2 2D vs 3D Processor...... 39 4.2.1 Floorplan...... 39 4.2.2 Wirelength Analysis...... 40 4.2.3 Power Analysis...... 43 4.2.4 Path Delay Analysis...... 43 4.3 Conclusion ...... 47

Chapter 5 Timing Driven Via Assignment in 3D-IC ...... 51 5.1 Timing Metrics...... 53 5.2 Optimal Assignment...... 54 5.3 Nearest-Neighbor Assignment...... 56 5.3.1 Timing-Ordered ...... 57 5.3.2 Contention Based...... 57 5.4 Resolving Multiple Sinks...... 59 5.4.1 Fan-In...... 59 5.4.2 Fan-Out...... 59 5.5 Timing Aware Cost Function ...... 60 5.6 Congestion Avoidance ...... 61 5.7 Experiment Results...... 62 5.7.1 Framework ...... 62 5.7.2 Runtime...... 63 5.7.3 Parameter Search...... 64 5.7.4 Wirelength Comparison...... 64 5.7.5 Quality of Result Comparison...... 64 5.8 Conclusion ...... 65

Chapter 6 Conclusion and Future Work ...... 67 6.1 Summary of Contributions...... 67 6.2 Future Work ...... 68

BIBLIOGRAPHY ...... 68

vi LIST OF TABLES

Table 3.1 Process technology metrics...... 15 Table 3.2 H3 core types ...... 16 Table 3.3 FabScalar processor metrics [33]...... 17 Table 3.4 Estimated maximum currents per metal width for vias and metals (mA per µm)[12] ...... 25 Table 3.5 Physical design metrics of the fabricated 3D-IC processor...... 29

Table 4.1 Face-to-face via experiment parameters...... 35

Table 5.1 Assignment runtime with 2500 x 2500 problem size (seconds) ...... 63 Table 5.2 Comparison of total wirelength between via assignment schemes (µm) . . . 64 Table 5.3 Comparison of WNS between via assignment schemes (ns) ...... 64

vii LIST OF FIGURES

Figure 2.1 Vernier TDC architecture for 3D on-chip timing measurements ...... 11 Figure 2.2 3D on-chip timing measurement scheme...... 11

Figure 3.1 Cross-section of 3D-IC stack...... 15 Figure 3.2 Prototype fabricated in IBM-8RF 130 nm...... 17 Figure 3.3 Inter-core state transfer scheme: fast thread migration (FTM) [10] . . . . 18 Figure 3.4 Inter-core state transfer scheme: cache core decoupling (CCD) [10] . . . . 18 Figure 3.5 The 3D-IC physical design flow...... 20 Figure 3.6 Detailed 3D-IC EDA tool flow...... 21 Figure 3.7 3D-IC heterogeneous processor floorplan...... 22 Figure 3.8 Inter-tier signal to via assignment flow...... 23 Figure 3.9 F2F via visualization and analysis tool...... 24 Figure 3.10 Physical design constraint model at the face-to-face interface ...... 27

Figure 4.1 Illustration of processor state transfer ...... 32 Figure 4.2 2D-FTM vs 3D-FTM area comparison...... 33 Figure 4.3 Buffer insertion requirement for low-latency inter-core data transfer in 2D 33 Figure 4.4 Layout of the 2D-TRF...... 34 Figure 4.5 Layout of the 3D-TRF...... 35 Figure 4.6 Power consumption comparison of PRF implementations...... 36 Figure 4.7 Impact of F2F pitch to wirelength ...... 37 Figure 4.8 Average wirelength of PRF output datapath signals ...... 40 Figure 4.9 Overall average wirelength of design ...... 41 Figure 4.10 Routing detour caused by congestion from high inter-tier signal pin density 42 Figure 4.11 Comparison of TRF track consumption ...... 43 Figure 4.12 Comparison of TRF track overflow...... 44 Figure 4.13 Processor floorplans for comparative analysis ...... 45 Figure 4.14 Comparison of average wirelength between 2D and 3D floorplans . . . . . 46 Figure 4.15 Processor power consumption comparison...... 46 Figure 4.16 Comparison of TRF datapath path delay between 2D and 3D floorplans . 47 Figure 4.17 Datapath of remote instruction cache read access with cache-core decoupling 48 Figure 4.18 Single cycle datapath of remote data cache read access with cache-core decoupling...... 49 Figure 4.19 Comparison of cache datapath path delay between 2D and 3D floorplans . 50

Figure 5.1 Minimum weight matching in bipartite graph problem...... 54 Figure 5.2 Flow for optimal via assignment ...... 55 Figure 5.3 Inter-tier signal sinks competing for nearest available F2F via ...... 56 Figure 5.4 Priority queue based assignment ...... 57 Figure 5.5 Iterative assignment based on contended bondpoints...... 58 Figure 5.6 Resolving inter-tier signal nets with multiple fan-in...... 60 Figure 5.7 Inter-tier signal nets with multiple fan-out...... 60

viii Figure 5.8 Static weighting concept used in the cost function of the optimal assignment 61 Figure 5.9 Modelling net weights as a dimension in neighborhood search assignment 62 Figure 5.10 Normalized TNS comparison between assignment schemes...... 65

ix CHAPTER

1

INTRODUCTION

Process technology scaling has provided performance and power improvements in digital inte- grated circuits. Interconnect scaling, however, does not proportionally scale with device scaling.

In modern process geometries, interconnect delay and routing congestion heavily impacts the achieved system performance.

3D integration offers potential performance and power improvements by reducing planar routing requirements. This is achieved during design by replacing long interconnects with routing through the vertical inter-tier interface. This wirelength benefit translates to reduced power consumption and area.

In the context of a heterogeneous multi-core processor, a stack of two cores are integrated using a thread migration bus which enables fast state transfer. The thread migration bus has

1 CHAPTER 1. INTRODUCTION requirements for low latency, wide interface, and inter-core connection points, which would be very challenging to implement in 2D. Since a CPU core generally has very tight timing requirements, intra-core floorplan requirements may have conflicting interests with the inter-core bus. 3D integration is able to resolve this conflict of interest with the addition of a vertical channel.

In this dissertation, we will present a comparative analysis of the processor and thread migration unit implemented in 2D vs 3D. We will first introduce the physical design methodology used for the tapeout of a 3D die-stacked heterogeneous multi-core processor. Finally, we will present a new 3D via assignment method for timing optimization.

1.1 Overview of the Following Chapters

As 3D integration is the focus of this dissertation, Chapter 2 provides an overview on 3D-IC with several relevant sub-topics. Chapter 3 presents the physical design methodology used for the tape-out of a heterogeneous multi-core processor in 3D-IC. Next, Chapter 4 presents a comparative analysis between a 3D-IC implementation of a heterogeneous multi-core processor compared to a 2D implementation. Finally, Chapter 5 proposes a methodology for timing-driven via assignment in 3D-IC physical design. The dissertation then concludes with a summary of contributions and discussions on future work.

2 CHAPTER 1. INTRODUCTION

1.2 Abbreviations

3DIC Three-dimensionally

ECO Engineering change order

CAD Computer-aided design

CCD Cache Core Decoupling

DRC Design Rule Check

EDA Electronic design automation

F2B Face to back

F2F Face to face

FTM Fast Thread Migration

HMP Heterogeneous multi-core processor

IF Instruction Fetch

IP Intellectual Property

LVS Layout versus schematic

P&R Place and Route

RF Register File

TRF Teleport Register File

TSV Through-silicon via

SRAM Static random access memory

3 CHAPTER

2

3D INTEGRATION

The scaling of silicon metal-oxide semiconductor field-effect transistors () has driven performance and power improvements for . In 1965 Gordon Moore predicted the increasing trend of the number of transistors on a single chip [1]. Dennard later proposed the scaling law which coupled geometry scaling with device characteristics, supply voltage, on-chip interconnect performance, and circuit performance [2]. However, recent developments in process technology show that Moore’s Law scaling trend is slowing down, caused by multiple physical limitations. First, to account for short-channel effects and maintain device characteristics (e.g. threshold voltage, leakage current, drain saturation current), process engineers have proposed new nonclassical complementary MOS (CMOS) materials and structures [3]. Second, lithography technology is also reaching physical limits and are becoming prohibitively expensive. Finally, since

4 CHAPTER 2. 3D INTEGRATION power supply voltage reduction has not been able to scale, this increases power density which introduces thermal and reliability concerns. Improving performance through faster clock frequency is no longer a viable option due to thermal hotspots.

Multi-core processor architectures have become a driver for improving system-level perfor- mance. Multi-core processors gain performance through parallel computation with theoretical speedup given by Amdahl’s law. Since performance enhancements could be achieved by adding cores, this relaxes the performance improvement requirement of individual cores which allows for reduced clock frequency and supply voltage. However, multi-core architectures require in- creased memory bandwidth which could run into the memory wall problem – whereby memory bandwidth becomes the bottleneck of system performance.

3D integrated circuits (3D-IC) technology is an emerging technology which enables memory bandwidth improvements, facilitated by dense interconnect between devices that are stacked/ bonded together (die-stacking). 3D-IC also enables larger on-chip memory size by stacking additional memory on the chip. Since accessing memory through the 3D interconnect requires less power, this improves the performance to power ratio (energy-efficiency) of the system.

3D-IC technology has also demonstrated performance and power benefits through wirelength reduction. Wire delay has become an increasingly dominant factor in the timing budget compared to gate delay. Geometry scaling continues to improve transistor performance; wire delay, however, does not scale accordingly. The vertical interconnect provided by 3D-IC enables reduction of both average and critical path length. Other 3D-IC applications include realizing heterogeneous process integration [4], and enabling new architectural features [5,6,7].

3D integration has gained commercial traction for memory and FPGA applications. The

Hybrid Memory Cube is a stacked RAM technology proposed by MicronTM [8]. The memory system consists of one logic die as the base, stacked with up to four memory dies using through- silicon-vias (TSVs). The 3D-IC implementation demonstrates increased performance, lower energy consumption, and higher bandwidth at a smaller form factor compared to 2D technology.

5 CHAPTER 2. 3D INTEGRATION

Another commercially released 3D-IC product is the Virtex 7 FPGA by Xilinx [9], which utilizes the stacked silicon interconnect (SSI) technology, also known as the 2.5D silicon interposer.

2.1 3D Multi-core Processor

As performance advantage from technology scaling slows down, computer architects are looking at architectural techniques to maintain the trend of performance improvement while maintaining the power consumption budget. One promising technique is the 3D-stacked single-ISA heterogeneous multi-core processors with provision for fast thread migration [10]. The vertical interconnect enables fast inter-core data transfer thus enabling improvements in overall system performance and energy efficiency. We have designed and fabricated a 3D-stacked heterogeneous multi-core processor, based on a fabricated and tested 2D version of this processor design.

For logic-on-logic designs such as the proposed 3D heterogeneous multi-core processor (HMP), the design process begins with a partitioning strategy. Partitioning can be conducted at multiple levels of granularity. In block-level partitioning, existing blocks are split between multiple dies with the floorplanning strategy of converting long critical global interconnect into short connections through the vertical interface. This partitioning strategy has been demonstrated by

Intel [11], showing 15% improved system performance by reducing interconnect between critical inter-block datapaths. One of the critical datapaths was the wiring between the data cache and the functional unit. By placing them stacked in a 3D implementation, the wire latency was reduced by one clock cycle. This optimization reduced the load-use latency by one cycle and improved energy per instruction [11].

To enable a modular stackable die concept, we partitioned the processor system such that each die can operate independently. Hence a multi-core die could be stacked with another multi-core die or DRAM memory. This feature allows for a product customization strategy, whereby a portfolio of multi-core chips could compose many different HMPs.

6 CHAPTER 2. 3D INTEGRATION

A heterogeneous multi-core processor benefits from frequent thread migration when it provides a performance or energy efficiency improvement [4, 10]. Frequent thread migration is enabled by low-overhead inter-core data migration mechanisms, which would be challenging to implement during physical design. The thread transfer scheme requires a dedicated bus interconnect which could span a great distance across the chip, hence costly in terms of area, power, and routing resources. 3D die stacking provides a vertical interconnect which could be leveraged for fast inter-core data transfer.

2.2 Challenges for 3D

There are five main design challenges for 3D-IC implementation: power delivery, heat density, clock tree distribution, design for test, and floorplanning [12]. First, the power delivery network needs to be designed with higher current carrying capacity and requires additional analysis for the inter-tier interface. Second, the heat density in 3D integration is higher due to the stacking of heat dissipating devices. One possible solution is to incorporate thermal vias into the design thereby lowering the effective-thermal resistance of the chip [13]. The third challenge is the clock tree distribution problem. If the clock tree consists of clock buffers located in multiple tiers, process variations can cause large inter-tier clock skew. Design for test is also a challenge in

3D system because TSVs are an additional point of failure. Design for test also can be more costly to do in 3D rather than in 2D [14]. Finally, floorplanning in 3D must be done with care to avoid congestion due to inter-tier via connections. Floorplanning must also consider potential thermal hot-spots due to the stacking of modules with large heat dissipation. Solutions to these challenges are current active fields of research as 3D-IC gains commercial adoption.

7 CHAPTER 2. 3D INTEGRATION

2.3 Design for Test

The physical interface between the dies in a 3D-stacked chip adds a potential source of man- ufacturing defect, both solder micro-bumps and especially TSVs. The TSVs are fabricated as cylindrical copper nails providing electrical connection from the active side through the silicon substrate to the back-side of a silicon die [15].

Prior to bonding, TSVs are not fully accessible because one of their ends is not connected to logic on other dies. These one-ended TSVs lead to pre-bond testability problems of reduced controllability and observability. The combinational part of die logic between the last level of scan cells and outbound TSVs can not be observed, while the logic between inbound TSVs and the first level of scan cells cannot be controlled. While this problem is more serious for logic-on-logic stacks, it is also a concern for the logic die in memory-on-logic designs.

Pre-bond testing is necessary for achieving high stack yield, through identification of known- good-dies (KGDs) before stacking. Without this capability, there is a risk of stacking a good die with a faulty die, hence reducing the overall chip stack yield. Without new design-for-test (DFT) techniques, the dies cannot be adequately tested before bonding. One solution is to add die wrapper cells for enabling both pre-bond and post-bond testing. However, with a large amount of TSVs, this would be prohibitively expensive to implement due to large silicon area overhead.

2.3.1 3D DFT Overhead

Wrapper cell insertion is a technique to enable prebond testing hence overcoming controllability and observability bottlenecks. Wrapper cells (WC) are inserted at the two ends of a TSV to facilitate post-bond testing of dies and the interconnects between dies [15]. Since the number of

TSVs on a die can be of the order of tens of thousands, the use of WC for each TSV can lead to significant area overhead. Moreover, WC on functional paths can lead to higher latency and performance degradation. To reduce the overhead of WC, the reuse of existing primary inputs

8 CHAPTER 2. 3D INTEGRATION

(PIs), primary outputs (POs), and scan cells for increasing the controllability and observability was proposed in [16]. The WC are required only for TSVs that cannot be controlled or observed using existing scan cells in the design. The work in [17] proposed a method to minimize the number of WC required for full testability, and showed that the general problem of minimizing the WC count is NP-hard.

2.4 On-chip Timing Measurements

Modern high-performance designs require accurate on-chip timing uncertainty measurements for post-silicon validation of high-speed interfaces and clock distribution networks. With increasing design complexity and process variations, post-silicon validation and debug capabilities should be improved to meet competitive product time-to-market. However, the act of stacking dies in 3D integration prohibits conventional probing techniques that require physical contact. On the other hand, the vertical interface can be leveraged to enhance on-chip measurements by improving accessibility to measurement points of interest.

Chip debug engineers probe clock sinks to observe the effect of multiple variation sources on the clock distribution network [18]. In addition, power and thermal events can be detected through the variation By adding more on-chip timing sensors, chip debug engineers would have a better understanding of on-chip timing uncertainty.

Measurements through inter-tier interface require modifications to existing measurement methodologies and circuit architecture. An on-chip timing measurement methodology for 3D-IC has been proposed in [19]. The proposed methodology consists of four major components: time-to-digital converters (TDCs), the clock sinks of interest, inter-tier vias, and a reference edge signal.

The proposed circuit architecture shown in Figure 2.1 is a Vernier delay line based TDC. The

Vernier delay line is known for its sub-gate delay time resolution and robustness against process,

9 CHAPTER 2. 3D INTEGRATION

Reference Signal Leading DUT clock Inverter Chain

SET D SET Q D Q D SET Q D SET Q Sampling Flip-flops CLR Q CLR Q CLR Q CLR Q Reference Signal Lagging DUT clock Inverter Chain

Edge Location Detector

Sticky Logic Sticky Mode

SET SET D Q D Q D SET Q D SET Q Accumulation Sample signal Flip-flops SI SI CLR Q CLR Q SI CLR Q SI CLR Q DUT clock CB Scan clock

Figure 2.1 Vernier TDC architecture for 3D on-chip timing measurements voltage, and temperature (PVT) variations [20]. The components of this circuit architecture are as follows. The sampling flip-flops function as an early-late detector between the two inverter chains. The edge location detection logic is used to enable multi-cycle jitter measurements. The

final step is to capture and store the measurement results in the accumulation flip-flop chain until scanned out by the user.

As illustrated in Figure 2.2, clock skew is measured by comparing different design-under-test

(DUT) clocks to a reference signal. The DUT clock and the reference signal are sent down the leading and lagging inverter chain respectively. The sampling flip-flop chain indicates how long the leading edge have propagated before the lagging edge arrives. Afterward, the sampled signals are processed by the edge detection logic and latched into the accumulation flip-flops when triggered by the “Sample” signal. This “Sample” signal could be generated by delaying the reference signal for the edge detection logic propagation delay, or independently. The time between the reference edge and the DUT clock edge is indicated by multiplying the number of consecutive zeros with the Vernier delay line resolution. Finally, the skew between two clock sinks is calculated by comparing the outputs of its TDCs.

10 CHAPTER 2. 3D INTEGRATION

Validation Die Reference TDC 1 skew signal Reference Signal Clock Sink A TDC 2

Voltage TDC 1 Measurement Product Die Clock Sink B TDC 2 Measurement

Sink A Sink skew B Time

Figure 2.2 3D on-chip timing measurement scheme

2.5 Routability Improvement

3D integration can improve the routability of a design from reducing total wirelength and enabling new system architectures. Consider a multi-core processor system which gains a performance per energy benefit from fast state migration between its cores. Due to intra-core floorplan constraints, it would not always feasible to place communicating structures next to each other.

An alternative would be to use the existing inter-core bus, but the bandwidth and routing resources are commonly used to support a shared last level cache (LLC).

Moreover, if the performance constraint requires a low latency transfer, timing and crosstalk problems will be mitigated through buffer insertion, which increases the area cost and power consumption of the system. While the wirelength benefits of 3D have a theoretical upper bound [21], 3D integration could enable new architectural features that would otherwise be prohibitively expensive to implement in 2D.

2.6 3D Via Assignment

3D integration requires new CAD tools and methodologies for design, analysis, and optimiza- tion [22]. In 3D logic-on-logic implementations, each inter-tier signal needs to be assigned to

11 CHAPTER 2. 3D INTEGRATION a 3D via, similar to assigning off-chip signals to area I/O pads in flip-chip methodology, but different in context and problem scale. The via assignment problem is a special case of the pin assignment problem where the pin locations are pre-determined at specific sites. The problem is a known chicken and egg problem: the placement step needs pin locations as a constraint, while the pin locations should be optimized based on cell placement [23]. In 2D-IC design flows, the I/O signals can be manually assigned to I/O pins based on design specifications and board interface.

There could be over tens of thousands inter-tier signals in a 3D-IC design, which makes manual 3D via assignment impractical. Hence a number of automated techniques have been proposed for the inter-tier signal to via assignment step. These techniques can be classified by its objective: by finding an optimal solution to a cost function, or by using greedy assignment approach.

The via assignment problem can be optimally solved using the Hungarian method [24], as demonstrated in [25, 26]. Alternatively, the problem can also be optimally solved using the max-flow as proposed in [27]. The optimal technique could be impractical to use, specifically for large number of assignments due to its cubic time complexity.

The nearest neighbor technique proposed in [25, 28], assigns an inter-tier signal to the nearest available through-tier via to its source/sink cell. A greedy technique based on Lee’s routing algorithm was proposed in [29]. In the nearest-neighbor approach, the wire-length between the signal pin and the via is commonly used as the cost function, and is approximated by computing the Manhattan distance. The nearest-neighbor query can be accelerated using a k-d tree [30], as demonstrated in [19].

The cost function used during via assignment could be enhanced to improve design quality of results. For instance, routability can be improved by incorporating weights based on local routing congestion [31]. In Chapter 5, we will describe a new cost function for enhancing timing quality of results.

12 CHAPTER

3

3D-IC PHYSICAL DESIGN METHODOLOGY

3.1 Fabrication Process Technology

The process technology used to fabricate our die-stacked processor was the GlobalFoundries

130 nm CMOS process followed by Ziptronix wafer to wafer bonding [32]. A cross-section of logic-on-logic die stacking is illustrated in Figure 3.1. The process consisted of 8 metal layers, the

first 6 metal layers were used for routing, the sixth and seventh metal layer were mainly used for power delivery, and the final metal layer was used to create inter-tier bond-points/face-to-face vias. Since the top metal layer is used for face-to-face bonding, existing I/O pad designs need to

13 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

I/O (TSVs) Bulk

Active First Metal Metal High-Performance Core Last Metal Face-to-face Bonding Metal Low-Power Core

Active Bulk

Figure 3.1 Cross-section of 3D-IC stack

Table 3.1 Process technology metrics

Process technology GlobalFoundries CMOS 130 nm 3D integration technology Ziptronix face to face bonding, TSV Standard cell library ARM SAGETM I/O pad library ARM GPIO Inline - custom TSV opening Memory macro generator ARM Artisan Physical IP Metal layers 6 thin + 2 thick Face-to-face via pitch 8 um Face-to-face via diameter 5 um

be modified with TSV openings to allow connections to the board through wire bonding.

In Ziptronix bonding process technology, a grid of inter-tier vias is created to maintain planarity and structural integrity of the chip [32]. This grid of vias provides a dense interface between the dies. The intellectual property (IP) designs used for the design is specified in

Table 3.1.

Each core of the two-core-stack is fabricated on independent dies. The fabricated design is a multi-project chip and has four designs on it. In addition to the heterogeneous processor, the chip also has an experimental SIMD core design, a stacked DRAM cache controller, and a prototype cross-tier asynchronous communication bus.

14 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Table 3.2 H3 core types

Core Type I Core Type 2 Fronted Width 2 1 Issue Width 3 3 Depth 9 9 IQ Size 32 16 PRF Size 96 64 LQ/SQ Size 16/16 16/16 ROB Size 64 32 LI I-Cache private, 4 KB, 1-way, 8 B block, 1 cycle, prefetch: no LI D-Cache private, 8 KB, 4-way, 16 B block, 2 cycle, prefetch: no

3.2 Processor Architecture

Heterogeneous multi-core processors (HMP) consists of multiple core types with diverse microar- chitecture features that are functionally equivalent. This micro-architectural diversity provides new performance and power levers [10]. Different phases in a program’s instruction stream differ in the distribution of performance degrading events (e.g. branch mispredictions, cache misses), and instruction-level parallelism (ILP). As the behavior of the executing program varies, the program is migrated to the most suitable core type based on a figure of merit.

The two cores in the fabricated 3D HMP chip were generated using FabScalar [33] and have different microarchitectures as shown in Table 3.2. The cores in each tier of the stack are connected by high bandwidth cross-tier buses that allow fast thread migration with a latency of less than a hundred clock cycles. The two micro-architectural mechanisms in this processor design for achieving fast state transfer are Fast Thread Migration (FTM) and Cache-Core Decoupling

(CCD), shown in Figure 3.3 and Figure 3.4 respectively. This HMP system is planned to be integrated with a die-stacked DRAM, forming a memory-logic 3D-IC system.

15 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Table 3.3 FabScalar processor metrics [33]

Processor Metrics Instruction Cache Size 2KB Data Cache 2 KB Physical Register File 96 entries x 32 bits SRAM Memory Macros 34 Inter-tier Signal Nets 6,077

Figure 3.2 Prototype fabricated in IBM-8RF 130 nm

3.2.1 2D Prototype

The annotated die photo of the 2D prototype fabricated with IBM–8RF 130 nm process is shown in Figure 3.2. The debug core in the center of the die is an in-order FabScalar core with full scan chain insertion. The branch target buffer (BTB), L1 instruction cache, and L1 data cache were implemented with SRAM macros. The heterogeneous core pairs communicated through the thread transfer bus at the center and demonstrated successful thread migration. This prototype has gone through chip bring-up and has been thoroughly tested for functionality.

16 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

ROB ROB

ARF ARF

Figure 3.3 Inter-core state transfer scheme: fast thread migration (FTM) [10]

Instruction Instruction Cache Cache

MUX MUX

Instr. Fetch Instr. Fetch

Core Core

Mem Access Mem Access

MUX MUX

Data Data Cache Cache

Figure 3.4 Inter-core state transfer scheme: cache core decoupling (CCD) [10]

3.2.2 3D Architecture

Another feature of the fabricated 3D-IC FabScalar processor is the cache core decoupling scheme shown in Figure 3.4. In this scheme, both the cores could switch access to the other core’s

L1 cache. CCD was implemented by multiplexing the clock, address, and data signals of both instruction and data L1 caches. The state elements used for implementing the L1 data cache and physical register file were flip-flops. The branch target buffer (BTB) and L1 instruction cache were implemented with compiled static random access memory (SRAM).

17 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

3.3 Design Flow

The design flow shown in Figure 3.5 begins with register-transfer-language (RTL) code for each design. Multiple designers followed a port naming convention to facilitate parsing in the subsequent design automation scripts. Each design was synthesized using Synopsys Design

Compiler with target clock period of 15 ns.

Since the tape-out consisted of multiple experiments, a merging process is required prior to performing physical design as shown in Figure 3.6. This merging process was automated to ensure correctness as the design netlist may be updated throughout the project. The merging process consists of two steps: top module netlist generation and I/O pad allocation. First, to aid top netlist generation, F2F ports are distinguished from I/O pad ports in Synopsys Design

Compiler. Second, the I/O pad ordering specified by the designer is merged and exported into the floorplan format of Cadence Encounter. At this point, all the inputs are now ready for placement and routing .

The netlist of the first tier is modified by removing the inter-tier signal ports. Inter-tier signal ports are initially removed since the placement steps require pin locations to be defined, otherwise, the place and route tool considers the F2F pins as regular I/O pins, and places the pin at an arbitrary location at the chip edge. This would cause many gates which are connected to inter-tier signals to be placed near the chip boundary, hence causing heavy congestion. The inter-tier signal ports will be added back to the design prior to the via assignment step. A common floorplan and powerplan are then created for both tiers, followed by initial placement and clock tree synthesis (CTS) on the first tier.

Next, F2F ports are added back in using engineering change order (ECO) commands, followed by loading the timing constraints for these ports. At this stage, the F2F via assignment step can now be performed based on the placement of the first tier. This via assignment result becomes a physical constraint to the placement and routing of the second tier. After the via assignment

18 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Tier 1 Tier 2 RTL RTL

Synthesis Synthesis

Tier 1 Tier 2 Netlist Netlist

Initial Placement

Clock Tree Synthesis Inter-tier signal assignment to F2F-bondpoints Place & Route Place & Route

Tier 1 Physical Verifcation Tier 2 Layout Static Timing Analysis Layout

Figure 3.5 The 3D-IC physical design flow step, the design can finally be routed.

After the design has been placed and routed, parasitics were extracted from both layouts to perform static timing analysis (STA) on the entire system using Synopsys PrimeTime. Based on the timing reports, we re-configure timing optimizations performed during place and route to minimize the number of manual fixes. Finally, physical verification (i.e. DRC and LVS) were performed using Mentor Graphics Calibre.

3.4 Floorplan

The floorplan of the 3D-IC design is shown in Figure 3.7. The top die consists of 3 experiments: a high-performance FabScalar core, a vector core, and an isolated F2F/F2B bus experiments; the bottom die has an additional L2 cache controller experiment with its I/O pads. To conserve area,

19 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Gate-level Categorized Tier 1 Placement Inter-tier Signal Ports Parser Verilog Netlist Ports List Top Netlist Tier 1 Pin Coordinates

I/O Pads per I/O Pad I/O Pads Top Netlist Tier 2 Experiment Allocator Tier 1 Generator Top Netlist 3D Via Assignment

I/O Pads I/O I/O Pads per Tier 1 I/O 3D Via Labels & P&R Tier 1 Tier 2 Placement Experiment Coordinates Tier 1 Layout

P&R Tier 2 Tier 2 I/O Tier 2 Layout

Figure 3.6 Detailed 3D-IC EDA tool flow the physical design flow used was the flat full-chip instead of the hierarchical-based methodology.

3.5 Via Assignment

The first step in the via assignment flow was to assign F2F vias for power delivery. The via stacks which connect the F2F vias to the power ring and power stripes were inserted using a custom fill script after place and route.

Second, we removed F2F vias located above memory macros from the list of available vias for inter-tier signal assignment. Routing accessibility to these F2F vias is limited due to hard routing blockages of the macros. This observation suggests that F2F vias in routing congested regions would also be problematic.

Finally, we obtain a list of available F2F vias for assignment to inter-tier signals. The assignment was conducted using a greedy nearest-neighbor technique implemented in Python.

Chapter 4 elaborates the assignment algorithm used.

Once the inter-tier signal assignment has been generated, the F2F pin locations were exported in the floorplan format of Cadence Encounter. During floorplan import, both the via shape and labels were created by the tool. The alternative method is to export the assignments as a list of

20 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

(a) Top die floorplan (b) Bottom die floorplan

Figure 3.7 3D-IC heterogeneous processor floorplan native “editPin” commands, but the former approach is preferred for much faster runtime.

Pin legalization step instructs the P&R tool to assign an actual legal pin that is located within our specified F2F via (top metal) shape. A legal pin is one that is aligned to the routing tracks hence routable by the tool.

3.5.1 Visualization Tool

To analyze and render the via assignment results, we created a visualization tool using web technologies. The flow starts by exporting the inter-tier signal pins and the via assignment results into JSON format. Next, we wrote a Javascript code to render this information into an interactive scatter plot in standard vector graphics (SVG) format using the open-source d3.js javascript library [34]. The user interface for interacting with the plot was created using HTML5 and CSS technologies.

This tool was helpful for providing feedback during floorplanning. For instance, we identified

21 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Cell Placement Placement

3D-Via Locations Assignment Signal-Via Algorithm Mapping Timing Constraints

Congestion Map

Figure 3.8 Inter-tier signal to via assignment flow that modules with a large amount of inter-tier signal pins should be placed closer to the center to improve routing access to the inter-tier vias. Another useful feedback was that these modules should be placed far enough from each other to prevent routing congestion due to inter-tier via contention.

3.6 Power Delivery

On-chip power delivery adds design challenges in three-dimensional integrated circuits (3D-ICs) when compared to 2D-IC designs. The power delivery network in the fabricated 2D design serves as a baseline for designing the 3D power delivery network. The grid approach was used for power distribution with the top two metal layers dedicated for power and ground. Additional challenges for 3D designs includes: 1) requiring larger supply currents flowing through the I/O pads and package power pins, 2) additional resistance on the power delivery network contributed by through-tier vias (micro-bumps/TSVs), 3) longer power delivery paths. Failing to consider these requirements can cause issues in functionality, simultaneous switching noise, jitter, signal quality, and thermal reliability.

The two main metrics measured during power delivery analysis are static and dynamic voltage (IR) drop (voltage droop). Static IR drop analysis evaluates the IR drop caused by

22 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Figure 3.9 F2F via visualization and analysis tool average currents flowing through the resistive power delivery network, whereas dynamic IR drop analysis evaluates the IR drop during peak current demand when a large amount of circuits switch simultaneously. In this design, dynamic IR drop was addressed by inserting decoupling capacitor cells throughout the chip as filler cells, covering 20% and 30% of core area for top and bottom die respectively. Power rail analysis was conducted using Cadence Encounter Power

System on the fabricated 2D design, and the reported maximum static/dynamic IR drop was

0.12 V, which is 10% of the nominal supply voltage (1.2 V).

Three conditions must be met to ensure adequate power delivery to the cores. First, the capacity of power I/O pads should be sufficient to deliver enough current to both tiers. Second, the power rings on the tier with the pads must have sufficient capacity to carry the required amount of current from the power I/O pads to the logic on both dies. Third, sufficient power must be delivered through the F2F vias to the other tier without significant IR drop.

To satisfy the first condition we calculated the required number of power I/O pads according to the guidelines provided by IP provider in the datasheet. The I/O pad arrangement has been

23 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Table 3.4 Estimated maximum currents per metal width for vias and metals (mA per µm)[12]

Temperature Via 5 Via 1-4 Metal 1 Metal 2-6 85◦C 3.13 0.87 6.28 8.47 110◦C 0.713 0.199 1.43 1.93 125◦C 0.451 0.126 0.91 1.22

validated in previous chip bring-ups, with 35 out of 92 pads allocated for power and ground, while the remaining was allocated for signals.

Next, the current carrying capacity for the power ring and stripes were increased to account for the additional tier. An additional global metal layer was allocated for the power ring, hence approximately doubling current carrying capacity. This did not impact design routability since we switched from staggered I/O pads to inline, hence reducing the routing requirements between the core area and the pad ring – where the power ring is located. Additionally, the number of power stripes was doubled to improve robustness towards IR drop, and increase the number of inter-tier power vias. The tradeoff is it reduces the number of available routing tracks for routing, hence potentially increasing area requirements due to congestion. We mitigated this risk by adjusting cell density during floorplanning to relieve congested areas and found that we did not need to increase area due to the additional power stripes.

Finally, to evaluate the final condition we calculated the voltage drop and current carrying capacity of the inter-core vias. The resistance of the F2F via was specified at 0.2 Ω for a via diameter that was smaller than 5 µm. We will use this resistance value as the worst-case scenario.

With 30,796 F2F vias, the effective resistance of the inter-tier vias is 6.494 × 10−6 Ω. The upper-bound power consumption of a FabScalar core was estimated at 185 mW; with a supply voltage of 1.2 V, the expected maximum current draw for core is 154.17 mA. Hence the voltage drop across the inter-tier vias is approximately 10-7 – a negligible drop. The total vertical current carrying capacity is limited by the via at 0.126 mA. Thus the total current carrying capacity

24 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY through the 30,796 vias is 3880.296 mA, greatly exceeding the expected current draw.

To maximize cross-tier power delivery through the F2F interface, we performed the following strategies – (1) We used exactly the same power grid structure for both dies. (2) The distance between power rings and stripes were multiples of the F2F via pitch. This ensures that columns of F2F vias perfectly align with the power stripes. (3) Finally, we created a cell which contains via stacks from the power stripes to the F2F vias using a custom CalibreTM script. This cell was instantiated at the same coordinates in both tiers.

3.7 Timing

3.7.1 Timing Constraints and Analysis

The inter-tier signal ports were initially absent from the netlist of the first tier. After clock tree synthesis, the inter-tier signal ports were added as an ECO step and the corresponding timing constraints were loaded as modeled in Figure 3.10. In Figure 3.10, a certain amount of input/output delay was assumed at the port based on synthesis results. Also, a certain amount of load capacitance is assumed to account for the F2F via and gate capacitance.

For sign-off timing verification, we performed parasitics extraction, followed by static timing analysis on both dies as one system using Synopsys PrimeTime. To achieve this, we first created a top module netlist that instantiates both dies. Next, we specified the timing corner used for each die separately, to model the effect of inter-die process variations from the wafer-to-wafer bonding process.

After analyzing the STA results reported by PrimeTime, we concluded that to satisfy timing requirements we had to insert hold buffers on each inter-tier signal datapath. For the remaining hold violations, we performed manual buffer insertion using ECO commands in

Cadence Encounter.

25 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

A B

Figure 3.10 Physical design constraint model at the face-to-face interface

3.7.2 Inter-tier Clock Skew Balancing

We conducted hold violation fixing as an engineering change order (ECO) step after routing has completed on all tiers. We used the timing sign-off tool Synopsys Primetime to find violating paths based on parasitics extracted from Cadence Encounter. Based on this report we performed manual fixes by inserting buffer cells or substituting cell types in Cadence Encounter. Afterwards, the timing checks were performed again in Primetime, and this process iterated until we reach satisfy all timing checks.

The advantage of this implementation is that it incurs less area and routing overhead than coarser-grained clock forwarding. The iterative process, however, can be prohibitively expensive in terms of time and design effort; the risk can be minimized by constraining the place and route tool with more aggressive hold margin.

The clock networks on the two dies are not automatically balanced by the CAD tools. To achieve a better balance between the clock trees of the two dies, we used tight constraints in clock tree synthesis. We also performed extensive post-layout static timing analysis to guarantee that constraints were met for cross-tier nets. For sign-off timing verification, we used the post-layout netlist and extracted parasitic information (SPEF format) for static timing analysis. First, we created a wrapper netlist that instantiates both dies. We then specified different process corners for each die to model the effect of inter-die process variations. This is a realistic assumption since the two dies being stacked come from different wafers. Based on the static timing analysis results, we inserted buffers on inter-tier paths to satisfy worst case hold requirements.

26 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

3.8 Physical Verification

3.8.1 Design Rule Checks

The majority of DRC violations after cell placement and routing were related to substrate/well taps and antenna rules. There were a few standard cell rows which did not satisfy the substrate and well taps insertion rule, which consisted of an array of tap-less decoupling capacitor cells.

The antenna rules in this process technology require antenna diode insertion when using certain top metal layers. Consequently, connecting to the F2F vias triggers this antenna rule.

The placement tool, however, could not always allocate space to place antenna diodes in high cell density regions. This is also the consequence of performing via assignment after cell placement.

The strategy used was to find a nearby filler cell and manually insert an antenna diode cell while maintaining DRC spacing rules, followed by manual routing. Furthermore, antenna diode insertion incurs area overhead which could degrade the benefits of smaller F2F pitch, this analysis will be discussed in Chapter 3.

3.8.2 Connectivity Checks

Inter-tier signal connectivity was verified by using our custom 3D-LVS tool. This tool first extracts the locations of 3D via shapes placed for inter-tier signals during place and route. The tool then verifies that the extracted results from both dies match. The 3D-LVS tool was also used to verify connectivity after performing manual ECO changes and DRC fixes.

3.9 Physical Design Metrics

Table 3.5 presents physical design metrics of the fabricated 3D-IC processor.

27 CHAPTER 3. 3D-IC PHYSICAL DESIGN METHODOLOGY

Table 3.5 Physical design metrics of the fabricated 3D-IC processor

Physical Design Metrics Die Dimensions 3.92 mm x 3.92 mm Core Area per die 9.57 mm2 Standard Cells (top die) 886,361 Standard Cells (bottom die) 678,854 Memory macros 34 Nets (top die) 482,479 Nets (bottom die) 328,535 Average net length (top die) 64.6 µm Average net length (bottom die) 66.9 µm Inter-tier F2F signal nets 6,077 Inter-tier power vias 30,796 Average F2F net length (top die) 86 µm Average F2F net length (bottom die) 140.3 µm

28 CHAPTER

4

3D-IC BENEFITS CASE STUDY

In the context of a heterogeneous multi-core processor system, exchanging state between cores requires a wide low-latency inter-core bus, introducing physical design challenges and tradeoffs in a 2D layout that are diminished with 3D integration. One such trade-off the 3D integration eliminates is one where the placement of state elements referenced for inter-core exchange should be near the edge of the core to minimize distance hence routing consumption. This requirement may conflict with tight intra-core timing constraints. In this chapter, we will quantitatively analyze the advantages of 3D integration. The first section will focus on the fast thread migration unit, followed by analysis of the overall processor system.

29 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

4.1 2D vs 3D Register File

4.1.1 Experimental Framework

The case study used for analysis is the design partition with the centerpiece of the fast thread migration scheme: the physical register file (PRF) and the teleport register file (TRF) shown in Figure 3.3. The TRF consists of controller logic and 35 registers, each 32 bits wide. The partition contains a register file with additional state elements, synchronizers for clock domain crossing, and a wide bus signal connection for state transfer.

The fabricated 3D-IC processor used the classic out-of-order (OOO) architecture with golden architecture state stored in the architectural register file (ARF). In contemporary

OOO architectures, the golden state is stored in a unified physical register file (PRF). In both architecture styles, the TRF would be strategically placed close to the PRF or ARF as demonstrated in the fabricated 3D-IC processor design.

4.1.2 Floorplan

In this experiment both the 2D and the 3D design have the same floorplan with an aspect ratio of 1.0, with the exception of the inter-tier signal pin locations. The 2D design required three horizontal metal layers (M3, M5, M7) for all 2242 inter-core signal pins to fit on one edge of the partition. Additional space between the standard cell area and I/O pin were needed to achieve fully legal routes with no DRC violations (e.g. shorts, open, spacing, antenna). This area is consumed for I/O signal routing to the partition edge, shown on the right-hand side of the partition in Figure 4.4.

30 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Core 1 Core 1 Core 2

Core 2

(a) 2D-IC (b) 3D-IC

Figure 4.1 Illustration of processor state transfer

4.1.3 Area Comparison

Area measurements shown in Figure 4.2 indicates that a 2D implementation consumes more area for cell density above 50%. The additional area overhead in 2D was caused by the routing congestion at the partition boundary and is required for successful routing of the inter-core signal pins. The 2D design was routable without extra area overhead at cell density 50%, but as the cell density decreases, the perimeter available for pins decrease accordingly. As the perimeter decreases, the 2D design has two options for maintaining pin routability: change the aspect ratio or allocate area for routing. Aspect ratio change impacts the floorplan of the full chip hence is not desirable as it moves the physical design problem back to the full-chip floorplanning stage.

Keeping the aspect ratio at 1.0 maintains a fair comparison, and it is widely known that an aspect ratio of 1.0 is the easiest to pack.

4.1.4 Power Analysis

Figure 4.6 shows power consumption results of one PRF/TRF unit at 80% cell density, sweeping across multiple input switching activity assumptions. The 2D 65 nm consumes approximately

50% less than the 2D 130 nm. The 3D 130 nm implementation was able to improve power consumption 50% less than the 2D implementation in a 20% input switching activity scenario.

The power savings come from the reduction of interconnect parasitics and buffer insertion due

31 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

2500 2,277 2D-130 3D-130 2,104 1,942 2000 1,853 1,702 1,663 1,582 1,521 1500 1,443

1000

500

Area (x 1,000 sq. microns) 0 50% 60% 70% 80% 85% Cell Density

Figure 4.2 2D-FTM vs 3D-FTM area comparison

TRF TRF

d Figure 4.3 Buffer insertion requirement for low-latency inter-core data transfer in 2D to routing congestion detours. The 2D power consumption result assumes an ideal 2D floorplan whereby the register files are placed right next to each other in the floorplan. If the two TRFs are separated by a certain distance, additional power may be required to maintain signal latency due to increased load parasitic capacitance. Maintaining signal latency can be achieved through gate sizing and buffer/repeater insertion. Hence the result shown assumes a best-case scenario for 2D design.

32 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

(a) 50% cell density (b) 85% cell density

Figure 4.4 Layout of the 2D-TRF

4.1.5 Face-to-face Via Pitch Analysis

Not every circuit design would benefit from a finer F2F via pitch, the benefit depends on whether the increased pin density would cause routing congestion. The ideal scenario is where the fan-out of each inter-tier signal is only one, and the inter-tier signal sinks can be placed at the same pitch as the F2F via. Hence the only routing resource used would be the via ladder that connects the signal sink to the via. This structure is commonly found in bit-cells of SRAMs or register

files, whereas for synthesized logic implementations single fan-out inter-tier connection is not as common.

In the TRF implementation, each register data signal has fanouts connecting to five cell pins and one antenna diode pin. This fan-out dictates the diminishing return on F2F via pitch is determined by the area consumed by connecting cells. This assumes that the connected cells are clustered together beneath the via, which is practically not the case due to timing constraints dictating cell placement.

Since the amount of inter-tier vias increase quadratically relative to via pitch, we expect

33 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

(a) 50% cell density (b) 85% cell density

Figure 4.5 Layout of the 3D-TRF

Table 4.1 Face-to-face via experiment parameters

F2F Pitch µm Via Diameter µm 1.5 0.75 3 1.5 5 2.5 8 3 10 3 wirelength to improve until a point of diminishing return, where the design is unable to fully utilize the inter-tier vias. Experiment results shown in Figure 4.7 shows that the point of diminishing return for the 3D-PRF design is at 5 micron F2F via pitch.

4.1.6 Routing Congestion

Routing congestion can be translated as the problem of supply and demand for routing resources.

In simple, routing congestion occurs when there are too many routes need to go through an area that does not have enough supply of routing tracks to accommodate them. The factor that affects supply in routing resources is the technology parameters, which includes die size, the

34 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

2D-130 3D-130 2D-65 40 35.45 35 30.21 30 25.96 24.84 25 20.77 19.44 20 15.58 15 18.44 10.39 15.71 10 12.99 10.26 5 Power Consumption (mW) Consumption (mW) Power 0 15% 20% 25% 30% 35% 40% 45% 50% 55% Input Switching Activity

Figure 4.6 Power consumption comparison of PRF implementations position of preplaced macros, and the number of layers. On the other hand, demand in routing resources is affected by the global routing and detail routing solution.

Routing congestion in logic synthesized design occurs when the routing resources demand through an area exceeds track supply. High density cells enabled by process geometries below

65 nm further makes the congestion problem more acute due to decreasing available routing tracks per cell [35].

As the complexity of chips grows, especially in 3D integrated system, routing congestion needs to be estimated during the early planning stages (i.e., floorplanning and placement stages) instead of waiting until the global routing stage. A highly congested region in placement will usually lead to routing detour around the region, which results in a larger routed wire length and worse timing. Hence, a congested area could negatively impact global interconnect latency.

4.1.6.1 Routing Congestion Metrics

The following section describes definitions related to routing congestion. Equation 2.1 presents the track overflow for a given bin v (T v). Track overflow is is defined as the difference between number of tracks required to route and the number of tracks available, when this difference is

35 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Design F2F 90 79.47 80 69.08 66.28 66.25 66.39 67.42 70 60 46.62 50 40 24.96 30 21.93 22.51 20 Average Wirelength 10 0 0 2 4 6 8 10 12 F2F Pitch (μm)

Figure 4.7 Impact of F2F pitch to wirelength positive, and zero otherwise. Congestion metric (Cv) is the ratio between demand and supply as explained in Equation 2.2.

  demand(v) − supply(v), demand(v) > supply(v) T v = (4.1)  0 otherwise.

The total track overflow (OF) is the total of the individual track overflows in all the bins in the block (Equation 2.3).

demand(v) Cv = (4.2) supply(v)

X OF = T v (4.3) ∀v∈V

4.1.6.2 Routing Congestion Reduction

In a 2D implementation, bus signals incur routing congestion both within the partition and across partitions. Transferring processor or program state across cores requires a high bandwidth

36 CHAPTER 4. 3D-IC BENEFITS CASE STUDY routing channel, which competes with the global on-chip network for shared last level cache

(LLC), commonly found in modern multi-core processors. Routing a dedicated global interconnect between the cores would incur additional area overhead.

Moreover, the requirement of a low-performance overhead poses additional physical design challenges as shown in Figure 4.3. As the distance between the cores increases, the amount or the strength of the inserted buffers also needs to increase accordingly to achieve similar latency performance. These buffers incur both area and routing overhead since the global interconnect needs to connect to the buffers from the global metal layers through vias, which consume routing track resources.

By providing a vertical interconnect, 3D-IC enables fast data migration with less routing challenges compared to 2D. While 3D-IC design poses new challenges, the wirelength benefit suggests that it could enable features that would be otherwise prohibitively expensive to implement in 2D.

4.1.7 Wirelength Analysis

Wirelength is a metric that could be used to estimate the physical design complexity and routability of a design. Figure 4.8 shows the wirelength measurements of the inter-tier signal on one tier.

At 85% cell density, the average wirelength of the TRF output datapath signals is reduced by 73%. In a 2D implementation, the additional routing overhead is incurred by routing to the boundary of the partition. Due to internal logic timing constraints and congestion avoidance, not every cell that would like to access the output pins could be placed near the boundary. The routing overhead is exacerbated by congestion at the boundary as the height of the partition decreases with cell density.

This additional routing overhead is not present in a 3D implementation since the output pins for the TRF datapath are placed in the partition. This eliminates the competing interest

37 CHAPTER 4. 3D-IC BENEFITS CASE STUDY between placing output driving cells near the boundary or closer to internal logic gates.

For 3D-FTM at a cell density of 85% we see an increase of 62% in average wirelength compared to 80%. This is caused by congestion which causes routing detours to the F2F via as depicted in

Figure 4.10.

Routing track overflow shown in Figure 4.12 is another common metric for measuring routing congestion. The first observation is that 3D-FTM always has less congestion than a 2D-FTM for a given cell density. Second, the 2D-FTM is more sensitive to higher cell density than the

3D-FTM. Since the aspect ratio is kept at 1.0, the perimeter available for the output pins of

2D-FTM decreases, the only other viable option is to use more routing layers. However this approach has its limits and a floorplan change would be required if the output datapath needs to be wider. Whereas in a 3D implementation, the routability of inter-core datapath is dependent on cell density and F2F via pitch – independent of the aspect ratio and pin density.

Figure 4.11 shows routing track consumption for the entire partition design. Routing track consumption decreases as the density increases – a tradeoff with routing track overflow. Large track overflow count indicates congestion, whereas small overflow count indicates designers could push for a higher cell density to save silicon area. 3D-FTM at 80% and 85% cell density consumes more routing track than at 70% cell density due to routing congestion to the F2F via as shown in Figure 4.10. To summarize, at 70% cell density, the 3D-FTM costs 9% less routing overhead and 9% less area overhead compared 2D-FTM.

4.2 2D vs 3D Processor

4.2.1 Floorplan

We created two types of floorplans for the analysis of 2D designs: 2D-inter and 2D-intra. 2D-Inter assumes that the core floorplan is customizable, whereas 2D-intra assumes that the core floorplan is fixed. The CPU core is highly-sensitive towards floorplan changes due to its high-performance

38 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

2D-130 2D-65 3D-130 400

331.64 338.63 339.06

m) 350 μ 296.81 300 282.00

250 217.32 212.39 213.34 218.09 203.11 200

150

100 86.83

Average ( Length Wire 48.79 53.62 50 31.72 36.15

0 50% 60% 70% 80% 85% Cell Density

Figure 4.8 Average wirelength of PRF output datapath signals requirements. All floorplans have a die dimension of 2.7 mm x 2.7 mm, yielding 81% and 66% cell density for the top and bottom die respectively.

The 2D-inter floorplan was optimized for inter-core communication, with inter-core modules placed near the edge. The aspect ratio of these modules was tailored to accommodate the wide bus signal pins. There are three modules which facilitate inter-core data transfer: the Transport

Register File for register file migration, the Instruction Cache Buffer, and the Load Store Unit which contains the data cache and supporting multiplexing logic.

The 2D-intra floorplan was duplicated from the 3D implementation, which was designed to optimize for intra-core timing. However, the final cell placement of this floorplan differs from the

3D implementation in order to meet timing constraints. For instance, in the 2D design, the data cache multiplexing logic cells are placed closer to the edge for better access to the I/O pins.

39 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

2D-130 2D-65 3D-130 140 130.16 129.15 129.79 128.98 125.07 120 m) μ 100 84.96 82.52 82.50 80.41 82.63 80 71.40 65.70 67.97 59.50 60 56.30

40

20 Average ( Length Wire

0 50% 60% 70% 80% 85% Cell Density

Figure 4.9 Overall average wirelength of design

4.2.2 Wirelength Analysis

The experimental results for wirelength measurements is shown in Figure 4.14. The measurement shown is for the total inter-tier signal wirelength on both dies.

In the first column, we observe that the average wirelength of the TRF datapath signals for the 2D-Inter floorplan is 1,216 um. In the previous TRF analysis section, the 2D design reported average wirelength of 678 um. This is caused by the difference in aspect ratio and floorplan shape between the two TRF designs. In 2D-Inter, the inter-core signal pins were placed on a single metal layer, as opposed to being interleaved on multiple metal layers. Placing the pins on a single metal layer reduces the risk of causing routing congestion near the partition edge, at the cost of aspect ratio change and potential wirelength increase. Changing the aspect ratio of a partition requires additional floorplan design effort during full chip integration, except when a

flat implementation flow is used. In 2D-Intra the average wirelength is almost double compared to 2D-Inter since the TRF module was embedded in the core 540 um from the partition edge.

The second and third column shows average wirelength for the instruction cache datapath,

40 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Figure 4.10 Routing detour caused by congestion from high inter-tier signal pin density while the fourth and fifth column shows for the data cache datapath. We can observe that the instruction cache datapath took a more significant hit in wirelength than the data cache datapath. This is because the tool favors placing instruction cache related circuits closer to caches itself than the partition edge, in order to satisfy intra-core timing constraints. This observation demonstrates the existence of competing interest between intra-core and inter-core constraints in the 2D designs.

Another reason for the difference in impact between the instruction and data cache datapath is the implementation of the memory macros: the instruction cache was implemented with compiled register file macros, whereas the data cache was implemented with standard flip-flops.

The flat implementation flow allows the data cache registers to be moved closer to the edge, hence the shorter wirelength compared to the instruction cache datapath.

Although the data cache (load-store unit) module is placed near the edge in both 2D-Inter and 2D-Intra, the difference in aspect ratio impacts the cell distance and available routing resources. The final column reports the average wirelength for the overall design, which increased by 10% and 22% for 2D-Inter and 2D-Intra respectively, compared to the 3D design. The parasitics of these interconnect would impact the power consumption of the system, which will be described in the following section.

41 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

3D-130 2D-130 1450000 50% 1400000 1350000

1300000 60% 50% 1250000 70% 1200000 85% 85% 80% 1150000 80% 60% 1100000 70% Routing TrackRouting Consumption 1050000 1000000 650000 750000 850000 950000 1050000 1150000 1250000 Area (sq. microns)

Figure 4.11 Comparison of TRF track consumption

4.2.3 Power Analysis

The average power measurement results shown in Figure 4.15 was obtained from Cadence

Encounter on the parasitics extracted layout, assuming 20% input switching activity. The power consumption shown is the average power of each module on both cores. The 3D design consistently consumes less power than 2D designs for modules relevant to inter-core data transfer.

The wirelength reduction in the 3D design translates to 20% and 31% core power consumption savings compared to 2D-Inter and 2D-Intra respectively.

4.2.4 Path Delay Analysis

The path delays shown in Figure 4.16 and Figure 4.19 were obtained from running static timing analysis on the final layout with parasitics extraction using Synopsys Primetime. The critical path delay is defined as the path with the least timing slack through a specific pin. The average path delay is defined as the average of the critical path delay through each inter-tier signal pin. For instance, for the 32 bit wide instruction PC signal (instPC), the average path delay is

42 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

3D-130 2D-130 35000 85% 30000 80% 25000

20000 85% 70% 80% 60% 50% 15000 70% Overflows Overflows 60% 50% 10000

5000

0 650000 750000 850000 950000 1050000 1150000

Area (sq. microns)

Figure 4.12 Comparison of TRF track overflow the average of the critical path delay through signal instPC[0], instPC[1], up to instPC[31]. To observe the impact of crosstalk, the path delay measurement results for 2D-intra were reported for both with and without signal integrity analysis.

4.2.4.1 Teleport Register File

The comparison of TRF datapath path delay is shown in Figure 4.16. We observe that the 3D design has a slightly better critical and average path delay compared to 2D-Inter. The distance between TRFs in the 2D-Intra coupled with the signal width of 1120 bits makes it susceptible to crosstalk. This shows that the 3D design is able to achieve similar performance while consuming less power and routing resources.

4.2.4.2 Cache Core Decoupling

The static timing analysis tool reported different timing paths for remote instruction and data cache read accesses, as illustrated in Figure 4.17 and Figure 4.18 respectively. The timing path for a remote instruction cache access crosses the inter-core interface once, whereas a remote data

43 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Core 0 Core 1

Instr. Instr. Cache Cache Core 0 Core 1 T T R R Instr. Cache Instr. Cache F F

TRF TRF Data Data Cache Cache Data Cache Data Cache

(a) Optimized for inter-core (2D-inter) (b) Optimized for intra-core (2D-intra)

Figure 4.13 Processor floorplans for comparative analysis cache read access traverses a round-trip path through the interface within a single clock cycle.

The instruction cache datapath consisted of two signals, instruction data (inst) and instruction program counter (instPC). The instruction data is obtained from reading the instruction cache, which was implemented as a synchronous compiled memory macro, hence serving as a timing endpoint. Hence these two signals are considered separate timing paths. The instruction program counter involves more computation and multiplexing logic, hence it has higher critical and average path delay than the instruction data bus. In Figure 4.19, the instruction cache column is represented by the more critical instPC signal.

The data cache datapath consisted of four bus signals, an address bus and a data bus for read and write access. The data cache memory array was implemented with standard cell flip-flops, and its address decoder was synthesized into standard cell gates. A remote read data cache access crosses the inter-core interface twice within a single cycle as illustrated in Figure 4.18.

First the signal is launched from the local cache address generation logic, followed by address decoding at the remote data cache. Since there is no synchronous element serving as a timing endpoint, the signal continues as the decoded data back into the local core. The place and route tool had to insert buffers in these long timing paths in order to meet timing constraints. In

44 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

3D 2D-Inter 2D-Intra 2500 2,377.332

2000

1,525.183 1500 1,331.486 1,216.152 1,232.245

956.202 1000 878.151 806.215

564.748 500 295.078 273.466 Average Wirelength (um) Average 230.171 149.822 116.898 100.315 70.268 77.056 85.884 0 TRF Channel I-Cache Data I-Cache PC D-Cache Write D-Cache Read Overall System

Figure 4.14 Comparison of average wirelength between 2D and 3D floorplans

3D 2D-Inter 2D-Intra 200 185.200 180 157.950 160 140 126.130 120 100 80 68.695 60.340 60 49.191 40

20 4.920 6.885 6.401 3.879 6.054 7.637 Power (mW)Power Consumption 0 I-Cache Buffer TRF D-Cache (Load-Store Unit) Core

Figure 4.15 Processor power consumption comparison contrast to a read access, a write cache access only needs to cross the inter-core interface once.

From Figure 4.19 we observe that the path delay for remote data cache read access is longer than a write access. The critical path delay for a data-cache read access is 22.562 ns, which violates the target clock period of 15 ns. Turning the signal integrity analysis off showed that the critical path met the timing constraint. There are two possible explanations for this.

First, this could imply that the place and route tool was not performed with signal integrity awareness. Second, this could indicate the lack of timing constraint specification. Specifying

45 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

3D 2D-Inter 2D-Intra 2D-Intra (SI OFF) 2.5

2.030 2

1.500 1.5 1.247 1.167 1.087 1.041 0.961 0.968 1 Path Delay (ns) Delay Path 0.5

0 TRF Channel Critical TRF Channel Average

Figure 4.16 Comparison of TRF datapath path delay between 2D and 3D floorplans timing constraint for pins requires additional design considerations, and could be challenging for paths that cross the interface more than once. Another potential solution is to specify the path as a multi-cycle path and ensure that the core waits for the appropriate latency.

4.3 Conclusion

In this chapter, we have presented a comparative analysis between a 2D and a 3D implementation of a teleport register file and a heterogeneous multi-core processor system. We demonstrated that the 3D TRF implementation of costs 10% less area, consumes 50% less power and 90% shorter bus signal wirelength compared to the 2D implementation in the same process node.

Furthermore, compared to a 2D 65 nm implementation, the 3D 130 nm implementation has 30% shorter total wirelength, 75% shorter interface signal wirelength. We also demonstrated that the 3D processor implementation consumes 31% less power and 22% shorter average wirelength compared to the 2D implementation.

The wirelength benefits of 3D integration for wide transfer bus structures does not only apply

46 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Core 0 Core 1 Core Select

Instruction PC Address Bus Core 0 Instruction Cache

Cache Select Instruction PC Core 1

Instruction Data Cache 1 Data Bus

Figure 4.17 Datapath of remote instruction cache read access with cache-core decoupling to global interconnect, but also internally in the source partition. With 2D integration, there may be competing interest for the placement of driver cells, between being placed closer to internal logic or closer to the output pins. Moreover, the fan-out may have to route to two opposite directions, one towards internal logic and the other towards the partition edge. Furthermore, buffers may need to be inserted to meet timing constraints, hence consuming silicon area and potentially causing routing congestion. In 3D integration, this competing interest is eliminated by the vertical interface.

We have also shown that logic-on-logic design may not always benefit from finer F2F pitch.

The wirelength benefit is dependent on the congestion and area consumed by fanout of the inter-tier signals. For the TRF design in this case study, the point of diminishing returns is at

5 µm pitch.

3D integration also alleviates design challenges for wide bus interconnect, which are susceptible to signal quality deterioration due to crosstalk. Layout techniques can be deployed to mitigate crosstalk, such as interleaving bus signals with others that switch at a different time. Other existing techniques include switching the bit order of bussed signals at every turn, interleaving narrow power/ground lines through the bus, and staggering inverting buffers [36]. The additional

47 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

Core 0 Core 1 Core 0 Core Select Address Bus Data-Cache Address

Core 1 Data-Cache Cache Select Address Cache 1 Core 0 . Data Cache 1 Data Bus Data-Cache Array Data

Address Decoder

Figure 4.18 Single cycle datapath of remote data cache read access with cache-core decoupling overhead and design effort in mitigating crosstalk in 2D implementation makes 3D integration a more attractive solution for implementing fast state transfer between multi-core processors.

48 CHAPTER 4. 3D-IC BENEFITS CASE STUDY

3D 2D-Inter 2D-Intra 2D-Intra (SI OFF) 25 22.562

19.466 20

14.713 14.914 13.866 15 13.957 13.837 14.257 13.710 13.652 13.171 12.832 13.079 12.175 12.067 12.647 11.786 11.560 11.767 11.654 10.250 10.170 10.144 9.797 10

Path Delay (ns) Delay Path 5

0 I-Cache Critical I-Cache Average D-Cache Read D-Cache Read D-Cache Write D-Cache Write Critical Average Critical Average

Figure 4.19 Comparison of cache datapath path delay between 2D and 3D floorplans

49 CHAPTER

5

TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

The 3D via assignment is a pin-assignment step that determines the mapping between inter-tier signals and 3D-IC vias (bond-points). This assignment step can be conducted after cell placement of either one or both tiers. Pin assignment is an important step of the design flow since it provides constraints to the placement stage, thus heavily determines the final design quality of result (QoR) [37]. Compared to manual signal to via assignment, automated assignment yields consistently better wirelength, faster turnaround time and less susceptibility to design errors [31].

The 3D via assignment problem can be viewed as a linear sum assignment or weighted

50 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC bipartite graph matching problem. Existing techniques focus on minimizing total wirelength as the objective function [25, 28]. The motivation for minimizing total wirelength of inter-tier signals is to minimize the routing between the signal and the via, hence optimizing latency and to avoid routing congestion. Since the inter-tier vias in a F2F bonding process are at the top-most layer, corresponding via stacks consumes valuable routing resources. Thus it is beneficial to keep the inter-tier signal sinks as close as possible to its F2F via, since congestion caused by the inter-tier signal interconnect could diminish the wirelength benefits of 3D integration.

Minimizing for total wirelength, however, assumes that all signals have equal timing budget requirements, which is not always the case. Optimizing for total wirelength may not yield assignments that would best meet the timing constraints of the design. While wirelength remains an important metric for timing optimization and congestion avoidance, meeting timing requirements is one of the main objectives of physical design. Timing information could be included in the cost model to achieve better timing QoR with less manual effort. Timing-driven concepts have been applied to other stages of physical design, such as timing-driven placement.

Existing timing-driven placement techniques use timing information such as slack and sensitivity to wirelength to guide the placement algorithm for better timing QoR [38].

Optimum assignment solvers could be prohibitively time-consuming to use during iterations of physical design due to its high time complexity. Alternatively, the neighborhood search method, a greedy approach, can be used to approximate the optimum solution. Although it does not guarantee in finding the optimum solution, it can trade off optimality for faster execution time. The nearest-neighbor technique proposed in [1] presented a brute-force implementation, which may not scale to large inputs. The proposed implementation in this work leverage an efficient data structure for nearest neighbor queries, the k-d tree [30].

The following chapter proposes a timing-driven via assignment designed for improving timing

QoR by prioritizing nets based on its timing criticality. The problem scope is the assignment of inter-tier signal pins to F2F vias in a 2-tier 3D-IC stack.

51 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

5.1 Timing Metrics

Timing driven via assignment is guided by timing metrics, which requires delay modeling and timing analysis. Timing-driven algorithms may use different levels of timing models to tradeoff runtime and accuracy. Since routing has not been performed prior to the via assignment step, the switch level RC model for gates and Elmore delay model for interconnects suffice.

Static timing analysis (STA) computes path delays using the critical path method [39].

Equation 5.1 defines timing slack (Slack) at timing point t as the difference between the earliest required arrival time (Required) and actual latest arrival time (Arrival). When the slack value of a timing path is positive, the timing requirements are met with spare timing margin; when slack is negative, the timing requirements are violated. Positive slack in a path shows that we can save power consumption by using weaker drive strength cells. Violating paths (with negative slack) require repairs through stronger drive strength cells or reduce wire delay through better placement or pin assignment. This chapter demonstrates that timing slack could be improved through timing-driven inter-tier pin assignment.

Slack(t) = Required(t) − Arrival(t) (5.1)

The most commonly used timing convergence metric is the worst negative slack (WNS), which indicates the difficulty of manually fixing timing problem. With Po denoting the set of timing end-points, i.e. data inputs of state elements and primary outputs (POs):

WNS = min Slack(t) (5.2) t∈P o

Another important timing closure metric is the figure of merit (FOM) [40], which is defined in Equation 5.3:

52 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

V1

S1

V2

S2

V3

S3

V4

Figure 5.1 Minimum weight matching in bipartite graph problem

X FOM = (Slack(t) − Slacktarget) (5.3) t∈P o,Slack(t)

where Slacktarget is the target slack of the design. Total negative slack (TNS) is defined as the FOM when Slacktarget is zero [41].

5.2 Optimal Assignment

The via assignment problem can be treated as the weighted bipartite matching problem shown in Figure 5.1, which could be solved in polynomial time using the Hungarian algorithm [24] with proven running time complexity of O(MAX(|V |3, |U|3)) [42].

The formal definition of the assignment problem is as the following.

Given:

• Two sets A and B, containing N and M elements (inter-tier signals and 3D-vias locations),

respectively.

• C(i, j) - Cost of assigning i to j, i ∈ A and j ∈ B

53 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

F2F Sinks Location

Assignment F2F Static (LSA) Sink-via Route Timing Solver Mapping Analysis

F2F Signal Slack

Figure 5.2 Flow for optimal via assignment

Optimization:

  X X Minimize  xijC(i, j) (5.4) i∈A j∈B

Such that: X xij = 1 ∀i ∈ A (5.5) j∈B

X xij = 0 or 1 ∀j ∈ B (5.6) j∈A

xij = 0 or 1∀i ∈ A, j ∈ B (5.7)

The variable xij is 1 if i is assigned to j, 0 otherwise.

Due to its time complexity, the optimal method is only preferred for local optimizations within a small partition instead of a full chip scale. We have ran multiple experiments with the optimal assignment using Matlab and found that the run-time is acceptable for small partitions but prohibitive long for full-chip scale assignment.

54 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

F2F Via

Sink

Figure 5.3 Inter-tier signal sinks competing for nearest available F2F via

5.3 Nearest-Neighbor Assignment

The nearest-neighbor approach could be used as an alternative to the optimal assignment for faster runtime. This approach is an approximation of the optimal method based on knowledge that an inter-tier sink only needs to consider vias within certain proximity. The neighborhood search is illustrated in Figure 5.3 – standard cell pins that are connected to inter-tier signals will be referred to as sinks. In a scenario where there are no contended vias, the nearest-neighbor method would yield the same result as the optimal assignment.

The nearest neighbor query is based on the k-d tree data structure [30]– a generalization of a binary search tree that stores points in k-dimensional space. The k-d tree is a computational geometry method used to provide fast nearest neighbor queries.

The algorithm starts by building a k-d tree for available F2F vias (bumps). Then, for each inter-tier signal, find the nearest available bump using the k-d tree. After each query, update the k-d tree by deleting the assigned bump, since it is no longer available. This technique has been proposed for implementation of the high-volume timing measurement scheme [19].

To apply the weighting scheme in the neighborhood approach, we propose modeling the slack information as an additional dimension to the k-d tree as shown in Figure 5.9.

55 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

Input: s_l: List of S inter-tier signal source/sink cell locations Input: v_l: List of V available via locations Output: A: List of cell to via assignments Require: S ≤ V 1: procedure AssignmentByOrder 2: SL ← Sort s_l by timing slack 3: kd ← Build k-d tree from v_l 4: for all s ∈ SL do 5: q ← SearchAndUpdateKD(s, kd, v_l) 6: Append (s, q) assignment to A 7: Append q to list of assigned nodes v_l 8: end for 9: return A 10: end procedure

Figure 5.4 Priority queue based assignment

5.3.1 Timing-Ordered

In the timing-ordered approach, the sinks are assigned sequentially based on its slack. The worst timing paths are given priority for assignment before paths with higher slack. The algorithm shown in Figure 5.4 as AssignmentByOrder, first sorts the list of signals by timing slack. A k-d tree is then built from the list of vias. For every signal in the sorted list, we perform a query to the nearest neighbor, and remove the chosen via from the k-d tree. This procedure ends with all the signals assigned to a 3D via.

5.3.2 Contention Based

Via contention is defined as when a via is regarded as the minimum assignment cost for multiple sinks. The contention based algorithm is shown as AssignmentContended in Figure 5.5.

The first step is to build two k-d trees, one built from the list of available F2F vias (via k-d tree), the other from the list of inter-tier signal sinks (sink k-d tree). In F indContended, for each sink we query its nearest neighbor on the via k-d tree. From this query we constructed a

56 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

Input: V : Set of m available via locations Input: C: Set of n inter-tier source/sink cell locations Output: A: Map of cell to bondpoint assignments Require: n ≤ m 1: procedure AssignmentContended 2: A ← {} 3: kd ← Build k-d tree from C 4: cb ← F indContended(C,V ) . sorted by contention 5: while length of cb > 0 do 6: for all b ∈ cb do 7: q ←SearchAndUpdateKD(b, kd, C) 8: Append (s, b) assignment to A 9: Remove b from V 10: cb ← F indContended(C,V ) 11: end for 12: end while 13: sigs ← AssignmentByOrder(V,C) 14: Add (sigs, V ) to A 15: return A 16: end procedure

Figure 5.5 Iterative assignment based on contended bondpoints list of contended vias, sorted by the number of contending sinks.

While the list of contended vias is not empty, we assign vias by querying the sink k-d tree.

For each contended via, we resolve the conflict by finding which sink is the best match – the least cost to assign. Since this is a greedy approach, the assignment is finalized by removing the via from the vias k-d tree, and the sink from the sink k-d tree. After each contended via is assigned, we perform F indContended again to search for contended vias. If there are contended vias, the process above repeats until there are no more contended vias.

Once no more vias are contended, the remaining sinks are assigned with the AssignmentByOrder procedure. At this point, the sequential assignment order no longer impacts the final result.

57 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

5.4 Resolving Multiple Sinks

A through-tier via may be connected to more than one sink (cell pin). Multiple fan-in is defined as when an inter-tier signal has multiple gates reading the signal value from the through-tier via.

Multiple fan-out is defined as when the gate that drives the inter-tier signal, additionally drives other gates. The assignment problem requires each inter-tier signal to be represented by a single location coordinate, hence the following strategies were deployed to specify the representative coordinate of multiple sinks.

5.4.1 Fan-In

For multiple fan-in, existing techniques suggest computing the mid-point/centroid as a represen- tative coordinate to the assignment problem [31]. To embed timing-awareness to this approach, we propose weighting each node by its timing criticality and calculating its weighted mean center

(weighted average). In Figure 5.6, suppose that the timing path from point C has less timing slack compared to paths A and B. The weighted mean center will suggest that the ideal F2F via location should be closer to C than the non-weighted centroid. The worst slack of all fan-in paths is used as the representative to the cost function. The objective of this approach is to proportionally prioritize wire delay to pin C compared to pin A and B.

5.4.2 Fan-Out

For cells with multiple fan-outs, the timing-driven heuristic is to select the driving cell pin coordinate to represent the inter-tier signal in the assignment problem. Both the location and the remaining slack of the driving cell is specified for the cost function. This approach aims to minimize the distance hence wire delay between the driver cell to the F2F via.

58 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

A

A

C B

F2F via C Weighted Mean Center B Mid Point (a) Multiple fan-in from F2F via (b) Weighted multiple fan-in

Figure 5.6 Resolving inter-tier signal nets with multiple fan-in

B1,2,3 F2F via B 4 Q A

Figure 5.7 Inter-tier signal nets with multiple fan-out

5.5 Timing Aware Cost Function

One important heuristic in via assignment algorithms is to optimize the total wirelength. This approach could be expanded with timing driven concepts using weighting schemes, such that timing critical paths are given priority to contended vias over less critical paths. Thus the new objective function of the algorithm is to minimize the total weighted wirelength.

While applying weighting is simple to implement, it is not always straightforward to find net weights that correlates well with timing metrics. Slack-based weighting schemes is an approach that uses timing slack as a measure of timing criticality. Two different weighting schemes for the objective cost function.

59 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

τB A B X

τA

ατA

A B A’ X

τA

Figure 5.8 Static weighting concept used in the cost function of the optimal assignment

The first scheme applies the parameter into a linear model, where  is a very small number used to prevent generating weight of zero.

slack +  − min(slack) w = ( ) ∗ α (5.8) max(slack)

The second scheme applies the parameter as the exponential component to emphasize critical nets.

slack w = (1 − )α (5.9) Tclk

5.6 Congestion Avoidance

Routing congestion should be considered during assignment because of three main reasons.

First, not every inter-tier via is reachable through routing. Second, via stacks to the inter-tier via consumes routing resources up to the top metal layer, hence potentially causing routing congestion. Third, inter-tier vias could add additional fan-out overhead depending on the process design kit. Antenna design rules could impose that antenna diodes must be inserted for every

60 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

B y τ A B τ z A X

x Figure 5.9 Modelling net weights as a dimension in neighborhood search assignment net that uses the thick routing layers. In addition to the area overhead of the antenna diodes, this also adds fan-out to the inter-tier signal net hence consuming routing resources.

A congestion-aware via assignment technique have been proposed in [31]. In this work, we mitigate the risk of via assignment induced routing congestion using the following approach.

Inter-tier vias that are located in congested regions are omitted from the list of available bumps.

Congestion regions can be identified by obtaining a routing overflow data from the global/trial route step. With this congestion information, we filter out inter-tier vias in regions that have a track consumption ratio above a certain threshold. Thus we lower the risk of causing heavily congested regions induced by dense vertical connections.

5.7 Experiment Results

5.7.1 Framework

The following experiments were conducted on the 3D-PRF 130 nm design discussed in Chapter

4, with 85% cell density and 8 micron F2F via pitch. The TRF unit was further constrained to a region in the partition, with dimensions that yield 85% density. The midpoint method for resolving multiple fan-ins was applied for all experiments. The slack timing reports were based

61 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

Table 5.1 Assignment runtime with 2500 x 2500 problem size (seconds)

Optimal-Hungarian 3554 Nearest-Neighbor Sort 16 Nearest-Neighbor Contend 19 on target clock period of 12 ns, which is 3 ns more aggressive than the achieved clock period in the fabricated 3D-IC design. Timing constraints were set for a maximum path delay of 6 ns.

Setting more aggressive timing constraint than the actual target is a method used to impose higher optimization effort to the tool, and to compensate for model inaccuracies.

There are four assignment schemes presented in the following section: OPT-WL (optimal assignment with total wirelength objective function), OPT-TD (timing-driven optimal assign- ment), NN-WL (nearest neighbor, contention based assignment with total wirelength objective function), and NN-TD (timing-driven nearest neighbor assignment).

5.7.2 Runtime

The optimal assignment solver used was an open-source implementation of the Hungarian algorithm in Matlab [43]. The linear sum assignment problem solver in the Scientific Python

(SciPy) package was also used to verify the results. Both optimal assignment solvers had a prohibitive turnaround time for the problem size of the experiment (2242 x 2500). To accelerate the solution, the neighborhood search approach using the k-d tree structure has been proposed in [19]. Table 5.1 shows the runtimes of optimal and neighborhood search assignment method using a server workstation machine with a 16-core 2.40 GHz processor and 192 GB RAM.

5.7.3 Parameter Search

We parallelized the parameter search process for both optimal and nearest-neighbor techniques.

The overall runtime for the nearest-neighbor technique on a server-grade machine was 5–10

62 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

Table 5.2 Comparison of total wirelength between via assignment schemes (µm)

OPT-WL 227,346 OPT-TD 229,722 NN-WL 227,549 NN-TD 229,466

Table 5.3 Comparison of WNS between via assignment schemes (ns)

OPT-WL -3.508 OPT-TD -3.046 NN-WL -3.646 NN-TD -3.143 minutes, which includes assignment, routing, and static timing analysis. The parameter search list used in the following experiment was: [0, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32] and the value that generated the best total negative slack (TNS) result was chosen for each assignment scheme and reported in the following section.

5.7.4 Wirelength Comparison

Table 5.2 shows the impact of timing-driven assignment on the total wirelength of F2F signal nets. For both assignments, the increase was approximately 0.01% in total wirelength, with an average of 1 µm increase per signal (2,242 signals).

5.7.5 Quality of Result Comparison

Compared to the optimal wirelength assignment method (OPT-WL), the timing-driven method was able to reduce the critical path delay by 0.46 ns and 0.365 ns for optimal and nearest-neighbor respectively as shown in Table 5.3.

The total negative slack (TNS) comparisons are shown in Figure 5.10. As a baseline, OPT-

WL shows the normalized slack of an optimal assignment with the total-wirelength objective cost function. The timing-driven method with the optimal assignment (OPT-TD) achieved 30%

63 CHAPTER 5. TIMING DRIVEN VIA ASSIGNMENT IN 3D-IC

0 OPT-WL OPT-TD NN-WL NN-TD -5

-10

-15 -14.46 -15.21

-20 Total Negative Slack (ns) Negative SlackTotal -20.74 -23.16 -25 Figure 5.10 Normalized TNS comparison between assignment schemes

TNS improvement compared to the baseline OPT-WL. The contention-based nearest-neighbor method with total-wirelength objective (NN-WL) performed 12% worse compared to the optimal assignment (OPT-WL). Finally, the timing-driven nearest-neighbor technique (NN-TD) achieved similar performance gain to its optimal counterpart (OPT-TD) with 27% TNS improvement compared to the baseline OPT-WL method, with run-time reductions shown in Table 5.1.

5.8 Conclusion

We have presented a timing-driven for assigning inter-tier signals to 3D vias. These techniques use a cost function based on timing criticality and distance. The proposed techniques are optimized to prioritize critical timing paths while minimizing overall total wirelength. A 3D-IC design layout of a register file with a wide data migration bus was implemented using the proposed optimal and approximation techniques. Experiment results show that the timing-driven method achieved 30% better TNS and 13% better WNS for the inter-tier bus datapath when compared to wirelength-driven method. 3D-IC designs with high 3D-via contention and strict timing requirements can benefit from the proposed techniques.

64 CHAPTER

6

CONCLUSION AND FUTURE WORK

6.1 Summary of Contributions

In this dissertation we have demonstrated how 3D integration can be used to enable and efficiently implement fast state transfer for a heterogeneous multi-core processor system. In a case study of two register files with thread migration capabilities, we have shown how 3D integration could reduce total wirelength by 30%, congestion by 45%, and power consumption by 50%. We have also demonstrated that the 3D implementation of a heterogeneous multi-core processor consumes 31% less power and 22% shorter average wirelength compared to a 2D implementation. Finally, we presented a new algorithm and methodology for timing-driven via assignment. The technique improved the total negative slack of inter-core data migration signals

65 CHAPTER 6. CONCLUSION AND FUTURE WORK by up to 30% compared to a via assignment algorithm with total wirelength objective function.

6.2 Future Work

In terms of future work based on this dissertation, we believe it would interesting to implement a 3D stacked DRAM with the heterogeneous multi-core processor. The comparative analysis of this stacked memory-logic system with existing systems would further examine 3D integration as a technology to continue the trend of integrating more function on a single chip. For the via assignment methodology, alternative net-weighting schemes and multiple tier via assignment could be explored. During design we noted the absence of mature 3D-IC EDA tools and methodologies for power delivery network analysis, 3D clock tree, placement, via assignment,

floorplanning, and 3D physical verification. Further development of these EDA tools would be necessary for 3D-IC to gain commercial adoption.

66 BIBLIOGRAPHY

[1] Moore, G. E. “Cramming more components onto integrated circuits. Electronics, 38 (8)” (1965).

[2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E, and LeBlanc, A. R. “Design of ion-implanted MOSFET’s with very small physical dimensions”. IEEE Journal of Solid-State Circuits 9.5 (1974), pp. 256–268.

[3] Skotnicki, T, Hutchby, J. A., King, T.-J., Wong, H. S. P., and Boeuf, F. “The end of CMOS scaling: toward the introduction of new materials and structural changes to improve MOSFET performance”. IEEE Circuits and Devices Magazine 21.1 (2005), pp. 16–26.

[4] Franzon, P., Rotenberg, E., Tuck, J., Davis, W. R., Zhou, H., Schabel, J., Zhang, Z., Dwiel, J. B., Forbes, E., Huh, J., and Lipa, S. “Computing in 3D”. 2015 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2015, pp. 1–6.

[5] Xie, J., Zhao, J., Dong, X., and Xie, Y. Architectural benefits and design challenges for three-dimensional integrated circuits. IEEE, 2010.

[6] Mysore, S., Agrawal, B., Srivastava, N., Lin, S.-C., Banerjee, K., and Sherwood, T. “Introspective 3D chips”. SIGPLAN Not. 41.11 (2006), pp. 264–273.

[7] Constantinides, K and Austin, T. “Using introspective software-based testing for post- silicon debug and repair”. Design Automation Conference (DAC), 2010 47th ACM/IEEE. 2010, pp. 537–542.

[8] Jeddeloh, J. and Keeth, B. “Hybrid memory cube new DRAM architecture increases density and performance”. 2012 IEEE Symposium on VLSI Technology. IEEE, 2012, pp. 87–88.

[9] Erdmann, C., Lowney, D., Lynam, A., Keady, A., McGrath, J., Cullen, E., Breathnach, D., Keane, D., Lynch, P., De La Torre, M., De La Torre, R., Lim, P., Collins, A., Farley, B., and Madden, L. “A Heterogeneous 3D-IC consisting of two 28nm FPGA die and 32 reconfigurable high-performance data converters”. 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 2014.

[10] Rotenberg, E., Dwiel, B. H., Forbes, E., Zhang, Z., Widialaksono, R., Chowdhury, R. B. R., Tshibangu, N., Lipa, S., Davis, W. R., and Franzon, P. D. “Rationale for a 3D heterogeneous multi-core processor”. 2013 IEEE 31st International Conference on Computer Design (ICCD). IEEE, 2013, pp. 154–168.

[11] Black, B., Annavaram, M., Brekelbaum, N., DeVale, J., Jiang, L., Loh, G. H., McCaule, D., Morrow, P., Nelson, D. W., Pantuso, D., Reed, P., Rupley, J., Shankar, S., Shen, J.,

67 BIBLIOGRAPHY

and Webb, C. “Die Stacking (3D) Microarchitecture”. 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06) (2006), pp. 469–479.

[12] Thorolfsson, T. Three-Dimensional Integration of Synthetic Aperture Radar Processors. North Carolina State University, 2011.

[13] Goplen, B and Sapatnekar, S. S. “Placement of thermal vias in 3-D ICs using various thermal objectives”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 25.4 (2006), pp. 692–709.

[14] Lee, H. H. S. and Chakrabarty, K. “Test Challenges for 3D Integrated Circuits”. IEEE Design Test of Computers 26.5 (2009), pp. 26–35.

[15] Marinissen, E. J. and Zorian, Y. “Testing 3D chips containing through-silicon vias”. Test Conference (ITC), 2010 IEEE International. IEEE, 2009, pp. 1–11.

[16] Li, J and Xiang, D. “DfT optimization for pre-bond testing of 3D-SICs containing TSVs”. Computer Design (ICCD), 2010 IEEE International Conference on. IEEE, 2010, pp. 474– 479.

[17] Agrawal, M., Chakrabarty, K., and Widialaksono, R. “Reuse-Based Optimization for Prebond and Post-Bond Testing of 3-D-Stacked ICs”. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems 34.1 (2015), pp. 122–135.

[18] Franch, R, Restle, P., James, N, Huott, W, Friedrich, J., Dixon, R, Weitzel, S., Van Goor, K, and Salem, G. “On-chip timing uncertainty measurements on IBM microprocessors”. Test Conference, 2007. ITC 2007. IEEE International. IEEE. 2007, pp. 1–7.

[19] Widialaksono, R., Zhao, W., Davis, W. R., and Franzon, P. “Leveraging 3D-IC for on-chip timing uncertainty measurements”. 2014 International 3D Systems Integration Conference (3DIC). IEEE, 2014, pp. 1–4.

[20] Dudek, P, Szczepanski, S, and Hatfield, J. V. “A high-resolution CMOS time-to-digital converter utilizing a Vernier delay line”. IEEE Journal of Solid-State Circuits 35.2 (2000), pp. 240–247.

[21] Mak, W.-K. and Chu, C. “Rethinking the Wirelength Benefit of 3-D Integration”. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20.12 (2012), pp. 2346–2351.

[22] Kim, D. H. and Lim, S. K. “Physical Design and CAD Tools for 3-D Integrated Circuits: Challenges and Opportunities”. IEEE Design Test 32.4 (2015), pp. 8–22.

68 BIBLIOGRAPHY

[23] Westra, J and Groeneveld, P. “Post-placement pin optimization”. IEEE Computer Society Annual Symposium on VLSI: New Frontiers in VLSI Design (ISVLSI’05) (2005), pp. 238– 243.

[24] Kuhn, H. W. “The Hungarian method for the assignment problem”. Naval Research Logistics Quarterly 2.1-2 (1955), pp. 83–97.

[25] Yan, H., Li, Z., Zhou, Q., and Hong, X. “Via assignment algorithm for hierarchical 3-D placement”. Proceedings. 2005 International Conference on Communications, Circuits and Systems, 2005. IEEE, pp. 1225–1229.

[26] Neela, G. and Draper, J. “Optimal techniques for assigning inter-tier signals to 3D-vias with path control in a 3DIC”. 2014 IEEE International Symposium on Circuits and Systems (ISCAS) (2014), pp. 802–805.

[27] Zhang, T., Zhan, Y., and Sapatnekar, S. S. “Temperature-aware routing in 3D ICs”. Asia and South Pacific Conference on Design Automation, 2006. IEEE, pp. 309–314.

[28] Neela, G. and Draper, J. “Techniques for assigning inter-tier signals to bondpoints in a face-to-face bonded 3DIC”. 3D Systems Integration Conference (3DIC), 2013 IEEE International. IEEE. 2013, pp. 1–6.

[29] Thorolfsson, T., Luo, G., Cong, J., and Franzon, P. D. “Logic-on-logic 3D integration and placement”. 2010 IEEE International 3D Systems Integration Conference (3DIC). IEEE, pp. 1–4.

[30] Bentley, J. L. “Multidimensional Binary Search Trees Used for Associative Searching”. Commun. ACM 18.9 (1975), pp. 509–517.

[31] Neela, G. and Draper, J. “Congestion-aware optimal techniques for assigning inter-tier signals to 3D-vias in a 3DIC”. 2015 International 3D Systems Integration Conference (3DIC). IEEE, 2015, TS8.23.1–TS8.23.6.

[32] Enquist, P. “Scalable direct bond technology and applications driving adoption”. 2011 IEEE International 3D Systems Integration Conference (3DIC), 2011 IEEE International. IEEE, 2012, pp. 1–5.

[33] Choudhary, N. K., Wadhavkar, S. V., Shah, T. A., Mayukh, H., Gandhi, J., Dwiel, B. H., Navada, S., Najaf-abadi, H. H., and Rotenberg, E. “FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template”. Proceeding of the 38th annual international symposium (2011), pp. 11–22.

69 BIBLIOGRAPHY

[34] Bostock, M, Ogievetsky, V, and Heer, J. “D3 Data-Driven Documents”. IEEE Transactions on Visualization and Computer Graphics 17.12 (2011), pp. 2301–2309.

[35] Clarke, M, Hammerschlag, D, and Rardon, M. “Eliminating Routing Congestion Issues with Logic Synthesis”. Cadence Whitepaper (2014).

[36] Rusu, S, Tam, S, Muljono, H, Ayers, D, and Chang, J. “A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache”. 2006 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. IEEE, pp. 315–324.

[37] Her, T. W. “Pin assignment with timing consideration [VLSI]”. 1996 IEEE International Symposium on Circuits and Systems. Circuits and Systems Connecting the World. ISCAS 96. IEEE, 1996, 695–698 vol.4.

[38] Alpert, C. J., Mehta, D. P., and Sapatnekar, S. S. Handbook of Algorithms for Physical Design Automation. CRC Press, 2008.

[39] Hitchcock, R. B. “Timing Verification and the Timing Analysis Program”. 19th Design Automation Conference (1982), pp. 594–604.

[40] Ren, H, Pan, D. Z., and Kung, D. S. “Sensitivity guided net weighting for placement-driven synthesis”. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24.5 (2005), pp. 711–721.

[41] Rajagopal, K., Shaked, T., Parasuram, Y., Cao, T., Chowdhary, A., and Halpin, B. “Timing driven force directed placement with physical net constraints”. the 2003 international symposium. New York, New York, USA: ACM Press, 2003, pp. 60–66.

[42] Burkard, R., Dell’Amico, M., and Martello, S. Assignment Problems, Revised Reprint. SIAM, 2012.

[43] Cao, Y. “Hungarian Algorithm for Linear Assignment Problems (V2.3)” (). url: http: //www.mathworks.com/matlabcentral/fileexchange/20652-hungarian-algorithm- for-linear-assignment-problems--v2-3-/content/munkres.m.

70