Hybrid Monolithic 3D IC Floorplanner Abdullah Guler, Student Member, IEEE, and Niraj K

1 Hybrid Monolithic 3D IC Floorplanner Abdullah Guler, Student Member, IEEE, and Niraj K. Jha, Fellow, IEEE

Abstract—With continued technology scaling, interconnects Hybrid monolithic designs consist of modules implemented have become the bottleneck in further performance and power in different monolithic implementation styles. They offer better consumption improvements in modern microprocessors. Three- trade-offs among chip performance, area, wirelength, and dimensional integrated circuits (3D ICs) provide a promising power consumption. In this paper, we introduce the first hybrid approach for alleviating this bottleneck and enabling higher monolithic 3D IC floorplanner (3D-HMFP) to enable efficient performance while reducing the footprint area, wirelength, and exploration of the hybrid monolithic design space. We use overall power consumption. Among various 3D IC solutions, monolithic 3D ICs stand out as they can utilize the third 3D-HMFP to compare different monolithic implementations dimension most efficiently owing to high-density monolithic inter- of the OpenSPARC T2 processor core in terms of footprint tier vias. Monolithic integration is possible at different levels area, wirelength, power consumption, and temperature. We of granularity: block-level, gate-level, and transistor-level. A implement these designs in 14nm FinFET technology. hybrid monolithic design has modules implemented in different We first create FinFET libraries based on technology monolithic styles to further optimize the design objectives such computer-aided design (TCAD) device simulations and gen- as area, wirelength, and power consumption. However, a lack erate cell layouts. We separate modules of the OpenSPARC of electronic design automation tools makes hybrid monolithic T2 processor core into logic and memory modules. In order to 3D IC design quite challenging. In this paper, we introduce characterize the logic and memory modules, we develop tools the first hybrid monolithic 3D IC floorplanner (3D-HMFP). We called FinPrin-monolithic and CACTI-monolithic, respectively, characterize the OpenSPARC T2 processor core using different monolithic implementations and compare their footprint area, atop our prior tools: FinPrin [9] and CACTI-PVT [10]. They wirelength, power consumption, and temperature. We show, via feed area, power, and wirelength values to 3D-HMFP for simulations, that under the same timing constraint, a hybrid floorplanning experiments. Compared to a conventional 2D monolithic design offers 48.1% reduction in footprint area and design, a hybrid monolithic design reduces the footprint area 14.6% reduction in power consumption compared to those of the by 48.1% and power consumption by 14.6% under the same 2D design at the cost of higher power density and slightly higher timing constraint of 1 ns (hence, under a clock frequency of temperature. 1 GHz). We use the thermal analysis tool HotSpot 6.0 [11] for 3D I.INTRODUCTION monolithic thermal analysis, and incorporate it into 3D-HMFP. Although a hybrid monolithic design consumes less power, its RANSISTORS have scaled for decades and become faster power density is higher due to the reduction in footprint area. in each technology generation. Interconnects, however, T This leads to slightly higher temperatures on chip. have become a growing concern due to increased wire resis- The rest of the paper is organized as follows. Section II gives tance with scaling [1], [2]. They have become delay and power a brief description of prior work on 3D OpenSPARC T2 case consumption bottlenecks in modern microprocessors [3], [4]. studies, 3D IC floorplanners, and hybrid monolithic designs. When repeaters and flip-flops are added to the interconnects Section III provides background information on monolithic 3D to reduce their delay, the power consumption of the micropro- integration and FinFETs. Section IV illustrates the potential cessor increases even more [5]. 3D ICs provide a promising approach for addressing the benefits of hybrid monolithic designs through an example. interconnect bottleneck because they enable reduction in over- Section V describes the simulation setup. Section VI intro- all interconnect length and total number of repeaters on long duces FinPrin-monolithic, which we use to characterize the interconnects [6]. Among existing 3D IC designs, monolithic area, power, and delay of logic modules. Section VII explains 3D integration offers better performance at less power con- how 3D gate-level placement can be done for GLM designs. sumption due to its high monolithic inter-via (MIV) density Section VIII describes CACTI-monolithic, which we use to [7]. The monolithic approach employs a sequential integration characterize memory modules. Section IX describes the pro- technique to fabricate layers on top of each other. Sequential posed 3D-HMFP in detail through problem formulation, T*- integration and dense MIV connections reduce interconnect tree representation, simulated annealing engine, cost function, length, introduce less parasitics due to their small MIV sizes, and global wire power consumption. Section X describes how and enable better layer alignment. There are three types of HotSpot 6.0 is used for thermal analysis of the chip. Section XI monolithic implementations: block-level monolithic (BLM) discusses floorplanning simulation results. Section XII presents [8], gate-level monolithic (GLM) [6], and transistor-level the concluding remarks and discusses future directions. monolithic (TLM) [7], as shown in Fig. 1. II.RELATED WORK This work was supported by National Science Foundation under Grant No. CCF-1318603 and CCF-1714161. A. Guler and N. K. Jha are with the 3D ICs have previously been demonstrated to be effective at Department of Electrical Engineering, Princeton University, Princeton, NJ, addressing the interconnect bottleneck. A physical design flow 08544 USA, e-mail: {aguler,jha}@princeton.edu. for 3D monolithic circuits was presented in [12] and shown

Digital Object Identiﬁer: 10.1109/TVLSI.2018.2832607

1557-9999 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. 2

L2 Cache

Core 2 Core 4

Core 1 Core 3

(a) (b) (c)

Fig. 1. Monolithic 3D integration: (a) block-level, (b) gate-level, and (c) transistor-level to decrease area, reduce wirelength, and improve performance level synthesis was integrated into a 3D floorplanner. This compared to the 2D implementation. In [6], the OpenSPARC floorplanner replaces modules with their alternatives to find T2 core was used as a case study and it was shown that the better designs. However, it cannot handle hybrid floorplanning GLM design has 50.0% smaller footprint area and 15.6% less under vertical constraints. 3D-HMFP, on the other hand, can power consumption compared to its 2D counterpart. Ref. [13] handle hybrid floorplanning and choose modules from different demonstrates that low-power design techniques that include monolithic implementations to obtain an optimal floorplan. folding functional unit blocks can help reduce the power It also takes the global interconnect power consumption into consumption of 3D ICs based on through-silicon vias (TSVs). account to obtain better floorplans and accurate temperature The OpenSPARC T2 core, implemented in a two-tier 3D estimation. We use 3D-HMFP to characterize the OpenSPARC design, was shown to offer up to 52.3% reduced footprint area, T2 processor core using different monolithic implementations 27.9% shorter wirelength, and 27.8% less power consumption and compare their footprint area, power, wirelength, and peak compared to its 2D counterpart. In [14], the OpenSPARC T2 temperature values. system-on-chip (SoC) was used as a case study to compare 3D monolithic implementations with a 2D implementation. It was III.BACKGROUND shown that a hybrid monolithic design with folded 3D logic- dominated modules and 2D memory modules can reduce the This section provides background material on monolithic 3D SoC power consumption by 8.3% when compared to its 2D integration and FinFETs. counterpart. The benefits of hybrid designs at the core level were not explored in these works. A. Monolithic 3D Integration To explore the hybrid monolithic design space, we need a 3D ICs can be fabricated using either parallel or sequential floorplanner that can handle both 2D and 3D modules. 3D integration. In parallel 3D integration, layers are processed modules can be viewed as vertically-aligned 2D modules. independently and connected by TSVs [22]. TSV-based 3D Several studies have been reported on floorplanning of mod- technologies have been extensively studied and shown to be ules under vertical constraints. A 3D floorplan representation, effective at reducing interconnect length [23]. However, TSV- namely the layered transitive closure graph, was used in based 3D ICs cannot fully utilize the benefits of the third [15] to handle inter-layer alignment. In [16], mixed integer dimension due to their large TSV diameter and layer alignment linear programming formulations were used to handle block issues [7]. In addition, parallel integration often uses 2D block- alignment constraints in 3D floorplanning. A 3D floorplanner level modules. This does not benefit from the third dimension based on sequence pair representation and 3D-graph-based at the gate or transistor level. packing algorithm to control vertical module alignments was On the other hand, in sequential 3D integration, also known proposed in [17]. A T*tree based 3D floorplanner, which as monolithic 3D integration, the layers are processed sequen- can handle vertically-aligned 2D modules, was reported in tially and connected using MIVs, which have much smaller [18]. A 3D floorplanning methodology to address interconnect diameter (around 50nm) compared to that of TSVs (around structures imposing module alignment constraints on TSV- 1µm). Therefore, monolithic integration offers higher density based systems was proposed in [19]. In [20], a fixed-outline 3D and less parasitics, delay, and power consumption. floorplanner that can handle folding and alignment of different Monolithic 3D integration can be implemented at different types of modules, such as soft, hard, folded, and stacked, levels of granularity: block-level, gate-level, and transistor- was proposed. However, these floorplanners do not include level. BLM uses standard 2D modules to construct the 3D global interconnect power and do not explore the hybrid design. It has the advantage of being able to use existing monolithic space. 3D-HMFP not only can handle vertical electronic design automation (EDA) tools to create 2D mod- constraints but also replaces modules with their alternative ules. However, it does not truly take advantage of the high implementations to find optimal hybrid solutions. In [21], high- MIV density since 2D modules do not benefit from the third 3 dimension. GLM uses standard 2D cells and places them on multiple layers to construct 3D modules. It offers reduction in intra-module wirelength and power consumption by utilizing MIVs inside the modules. It, however, requires 3D gate-level placement to generate the modules. In TLM design, n-type and p-type transistors are fabricated on different layers and can be optimized separately. It can use 2D EDA tools, such as place- and-route and floorplanning. However, a TLM design requires a new cell library. A hybrid monolithic design combines modules from all monolithic styles to obtain better trade-offs among various design objectives. For example, it may use 3D logic modules implemented in GLM and 2D caches implemented in BLM, since they are already optimized for 2D design. To explore the hybrid monolithic design space, a hybrid floorplanner is Fig. 2. 2D cross-section of a 3D FinFET needed, which is the focus of this paper. Although, in this TABLE I. 14NM FINFETDEVICE PARAMETER VALUES work, we target two-tier monolithic 3D designs, the same methodology can be applied to monolithic designs with more Parameter (unit) Value LG(nm) 16 tiers and TSV-based designs under vertical constraints. TSI (nm) 8 Monolithic integration has a thermal budget issue. The TOX (nm) 0.9 HFIN (nm) 30 upper-layer transistors need to be processed at a lower tem- LSP (nm) 8 perature or else the lower layers need to use metals such LUN (nm) 8 as tungsten, which are thermally more resistant. However, FP (nm) 42 GP (nm) 70 −3 15 it has been shown that the performance of the top-layer NBODY (cm ) 10 −3 20 transistors, which are processed at a lower temperature, can NSD (cm ) 10 match the performance of the transistors on the bottom layer ΦG(eV ) nFinFET : 4.4, pFinFET : 4.8 [24]. Thus, in this work, we use the same transistor model for TABLE II. FGU FOOTPRINTAREAANDPOWERVALUESFOR all monolithic implementations. We assume that the power and DIFFERENT MONOLITHIC IMPLEMENTATIONS timing are the same for BLM/GLM and TLM cells while the 2 area of the TLM cells is different. Monolithic type Area (µm ) Power (µW) BLM 21696 22973 TLM 13821 21396 B. FinFETs GLM 10633 19866

We implement monolithic designs using FinFETs, as they TABLE III. FOOTPRINTAREAVALUESASSUMEDFORTHEMODULES have replaced planar MOSFETs due to their superior short- TO BE FLOORPLANNED channel effect behavior, increased performance, and better Logic Memory scalability. However, the proposed floorplanner is applicable Monolithic type A B C D E to planar MOSFET designs as well. BLM 12 6 8 15 9 Fig. 2 shows a 2D cross-section of a 3D FinFET. The GLM 6 3 4 - - FinFET parameter values we use are shown in Table I. They are gate length LG, fin thickness TSI , oxide thickness may not guarantee the best chip design, given the complicated TOX , fin height HFIN , spacer thickness LSP , gate underlap architecture of modern microprocessors that contain tens of LUN , fin pitch FP , gate pitch GP , channel doping concen- modules. tration NBODY , source/drain doping concentration NSD, and The following example shows how a hybrid monolithic Φ gate workfunction G. We assume 14nm FinFET technol- design can offer trade-offs between desired objectives, such ogy. We use FinFET and technology parameter values from as area and power consumption. Suppose we have five logic the literature and previously reported data by semiconductor and memory modules with relative footprint area values shown foundries [25], [26], [27]. in Table III. A, B, and C refer to logic modules that can be implemented in both BLM and GLM, whereas D and E refer IV. MOTIVATION to memory modules that are only implemented in BLM. In this section, we demonstrate how the hybrid monolithic Fig. 3 shows different floorplanning scenarios. The first design style can be used to find an optimal design. Table II design consists of only BLM modules. It achieves a footprint shows the footprint area and power values of the OpenSPARC area of 26 (12+6+8) in the best case. The second design uses all T2 floating-point and graphics unit (FGU) implemented in GLM logic modules to reduce power consumption. However, different monolithic styles using FinPrin-monolithic. The GLM its minimum achievable footprint area is 28 (6+3+4+15). On implementation of FGU has the smallest footprint area and the the other hand, the third design utilizes both BLM and GLM lowest power consumption compared to its BLM and TLM logic modules and achieves a footprint area of 25 (6+4+15). implementations. However, using GLM modules ubiquitously Although a BLM implementation of module B has higher 4

FinFET parameter values

E A Sentaurus TCAD Magic VLSI D D C A D Gate power/timing Gate area B C

Design Compiler library FinFET logic library FinFET memory library

Memory/cache C A RTL netlist Design Compiler B structure B C Gate-level netlist FinPrin-monolithic CACTI-monolithic E A A E B C Logic modules Memory modules area/power/timing area/power/timing (a) (b) (c) Interconnects 3D-HMFP

Area/power

Fig. 3. Floorplanning results of different monolithic implementations showing HotSpot 6.0 the beneﬁt of hybrid design: (a) BLM, (b) GLM logic + BLM memory, and (c) GLM/BLM logic + BLM memory Temperature

power consumption than that of its GLM implementation, Fig. 4. The hybrid monolithic design flow the footprint area reduction in the third design may reduce silicon cost, increase yield, and decrease global wire power consumption. for the 2D design and different 3D monolithic designs, such as Despite its simplicity, this example shows the need for a BLM, TLM, and hybrid monolithic. Lastly, we use HotSpot 6.0 hybrid monolithic floorplanner. A processor core has tens of [11] to calculate the temperature of the chip. In the following modules some of which can be implemented in 3D, whereas sections, we describe the tools we use in the hybrid monolithic others are better implemented in 2D. design flow.

V. SIMULATION SETUP VI.FINPRIN-MONOLITHIC Floorplanning experiments require area and power consump- We use FinPrin-monolithic to calculate the delay, area, tion values of the modules. We characterize the modules of and power consumption of logic modules. We build FinPrin- the OpenSPARC T2 processor core using different monolithic monolithic on top of FinPrin [9], a FinFET logic circuit anal- styles and feed the area and power values to 3D-HMFP for ysis and optimization tool. The FinPrin-monolithic simulation floorplanning. Fig. 4 shows the hybrid monolithic design flow. flow is shown in Fig. 5. The steps of this simulation flow are First, we use Sentaurus Device Simulator [28] to perform 2D as follows: hydrodynamic mixed-mode device simulations to characterize 1) FinPrin-monolithic takes the gate-level netlist generated FinFET logic and memory cells. We generate BLM and TLM by Design Compiler and converts it into the GSRC cell layouts using the Magic VLSI layout tool [29] to obtain bookshelf format for place-and-route. the area values of the cells. We characterize the FinFET logic 2) Capo, a 2D floorplacer [31], performs row-based place- and memory libraries based on timing, power consumption, ment. For GLM modules, an intermediate step generates and area values obtained from device simulations and cell 3D gate-level placement. Details of the gate-level place- layouts. For the FinFET logic library, we characterize INV, ment are given in the next section. FinPrin-monolithic NAND, NOR (sizes 1×, 2×, 4×, 8×, and 16×), and a D applies global routing to the placement. flip-flop (DFF). The FinFET memory library has a 6T SRAM 3) FinPrin-monolithic calculates the area, power, and delay cell in addition to logic cells. We separate the OpenSPARC values of the circuit and determines the critical path based T2 processor core components into two groups: logic and on the FinFET logic library and temperature. We have memory modules. Logic modules include components such as incorporated FLUTE, a fast lookup table based rectilin- the decoder and execution unit, which are synthesizable by ear Steiner minimal tree algorithm [32], into FinPrin- Design Compiler [30]. We use FinPrin-monolithic, which is monolithic for more accurate wirelength and, conse- built on top of FinPrin [9], to characterize the logic modules. quently, delay and power calculations. Memory modules have caches, register files, etc., which are 4) If the timing constraint is violated, FinPrin-monolithic not synthesizable. We use CACTI-monolithic, which is built on adds repeaters to only the interconnects on the critical top of CACTI-PVT [10], to characterize memory modules. We path. obtain area and power consumption values of modules under 5) FinPrin-monolithic repeats steps 3 and 4 until the timing 1 ns timing constraint from FinPrin-monolithic and CACTI- constraint is met or adding repeaters on the critical path monolithic and feed these values to 3D-HMFP. 3D-HMFP does not decrease the delay anymore. also calculates the global wire power consumption. It produces We characterized 13 logic modules of the OpenSPARC overall chip area, wirelength, and power consumption values T2 processor core using FinPrin-monolithic. They are the 5

TABLE IV. FINPRIN-MONOLITHIC RESULTS

Monolithic type Area Area Wirelength Wirelength Dynamic Leakage Wire Total power Total power (µm2) reduction (%) (µm) reduction (%) power (µW) power (µW) power (µW) (µW) reduction (%) BLM 94659 0.0 4490727 0.0 49956 3545 41099 94600 0.0 TLM 61072 35.5 3736031 16.8 49018 3468 34192 86678 8.4 GLM 47294 50.0 3141085 30.1 48970 3462 28747 81179 14.2

Gate-level P diffusion Gate netlist N diffusion Metal 1 GSRC format P FET Metal 2 N FET Contact

Capo

Is it gate-level Yes Gate-level placement monolithic?

Temperature/ frequency Power model Area model Repeater insertion (a) (b) (c) FinFET logic Delay model Wire model FLUTE library Yes Fig. 6. 8× NAND cell layout: (a) BLM/GLM, (b) TLM n-tier, and (c) TLM p-tier Is the timing No Can add constraint met? repeaters?

Yes

Area, power, No and delay

Fig. 5. The FinPrin-monolithic simulation flow instruction fetch unit consisting of three sub-units, decode unit, two execution units, load-store unit, floating-point unit, trap Fig. 7. Gate-level monolithic placement steps: (a) cell deflation, (b) deflated 2D placement, (c) cell inflation, and (d) cell layer assignment logic unit, memory management unit, gasket unit, performance monitor unit, and pick unit. We assume 1 GHz clock frequency and 330 ◦K temperature. FinPrin-monolithic results are shown 14.2%, respectively. The main contributor to the power re- in Table IV. It shows the total footprint area, wirelength, and duction in GLM modules is the decrease in wirelength. A power consumption of the 13 logic modules. TLM modules 30.1% decrease in the wirelength is responsible for a 13.1% have 35.5%, 16.8%, and 8.4% reduction in the footprint area, decrease in the total power consumption. The wirelength in wirelength, and power consumption, respectively, compared GLM modules, however, is slightly optimistic because FinPrin- to those of BLM modules. Fig. 6 shows the layout of an monolithic does not include the wirelength in the z-dimension 8 NAND cell in BLM/GLM and TLM. TLM cells have × (which we estimate would only contribute 1-2% to the total higher total silicon area compared to BLM/GLM cells due wirelength). Note that FinPrin-monolithic simulation results to their use of intra-cell monolithic vias. The height of a are not based on on an RTL-to-GDSII design flow, which we BLM cell is 84 (0.588 m) while the height of a TLM λ µ plan to adopt in our future work. cell is 54λ (0.378 µm). Depending on how the TLM cells are implemented, the footprint area can be smaller than the BLM cells by 46% [33], 44.4%, or 38.8% [34]. Our TLM cells have VII.GATE-LEVEL MONOLITHIC PLACEMENT 35.7% less footprint area compared to the 2D cells, which is reasonable relative to prior work. Although TLM modules have 2D placement tools can be used for BLM and TLM designs. a smaller footprint area due to their smaller cell layouts, their We modify a 2D placement process to perform GLM place- total silicon area (on both layers) is 29.0% larger than that ment. We use a similar approach to the ones reported in [12] of BLM. Their power reduction is mostly due to a decrease and [6] to perform GLM placement. The placement steps are in wirelength. GLM modules offer the smallest footprint area, shown in Fig. 7. the shortest wirelength, and the lowest power consumption. (a) Cell deflation: The cell widths are halved. Compared to BLM modules, their footprint area, wirelength, (b) Deflated 2D placement: Capo [31] is used for row-based and power consumption are reduced by 50.0%, 30.1%, and placement of deflated cells. 6

TABLE V. PLACEMENT RESULTS OF 14 TESTCIRCUITS

Design Area (µm2) Area Wirelength (µm) Wirelength reduction (%) reduction (%) 2D 2680 0.0 88192 0.0 Deﬂated 2D 1356 49.4 62667 28.9 ZOLP 3D 1408 47.5 65456 25.8 Greedy 3D 1361 49.2 63443 28.1

(c) Cell inflation: The cell widths are returned to their original values, leading to cell overlaps. Fig. 8. Greedy layer assignment vs. ZOLP layer assignment [12] showing the (d) Cell layer assignment: Cells are assigned to one of the area benefit of greedy method: (a) deflated 2D placement, (b) cell inflation, (c) cell layer assignment, and (d) legalization (just for ZOLP) layers in 3D design. A greedy algorithm is used to minimize the placement area and remove the overlaps. FinFET logic FinFET memory The algorithmic steps for greedy layer assignment are illus- Cache structure trated with the example shown in Fig. 7. The algorithm starts library library with the first row and assigns its first cell to the bottom layer and its second cell to the top layer. Then, it considers each cell in the row in order and assigns it to the layer with less Explore possible total cell width. Its aim is to minimize the footprint area. It cache configurations repeats this process for all rows. We compared our cell layer assignment method with the Power model Area model Zero-One Linear Program (ZOLP) method presented in [12]. Cost function ZOLP formulates cell layer assignment as a linear program- Delay model Wire model ming problem. It performs cell assignment to minimize the total overlap between cells while assuming that the cells have a fixed position. A legalization step follows ZOLP layer Area, power, assignment to remove the remaining overlaps, which may and delay increase the placement area and wirelength. We compared the two methods using 14 different logic circuits, such as Fig. 9. The CACTI-monolithic simulation flow s713, s1196, s1488, and s9234 from the ISCAS’89 benchmark suite, and the arithmetic-logic unit and multiplier unit from the OpenSPARC T1 benchmark. These circuits are only used we choose the greedy method since it performs slightly better to compare gate-level placement methods and are different in area and wirelength and is significantly faster than ZOLP. from the OpenSPARC T2 modules that are characterized in Neither method minimizes the number of vias. To minimize this paper. Table V shows the total area and wirelength values the number of vias, a min-cut partitioner can be used to split of the circuits for different placements. The greedy method the circuit [6]. performs slightly better in both footprint area and wirelength compared to ZOLP. Compared to 2D placement, ZOLP reduces the footprint area and wirelength by 47.5% and 25.8%, respec- VIII.CACTI-MONOLITHIC tively. Our method, on the other hand, reduces the footprint We used CACTI-monolithic to characterize 22 memory area and wirelength by 49.2% and 28.1%, respectively, as modules of the OpenSPARC T2 processor core. CACTI- shown in Table V. monolithic is built on top of CACTI-PVT [10], which is a Fig. 8 shows a comparison of the above two methods on FinFET cache/memory modeling tool. The CACTI-monolithic a small example to demonstrate how the greedy method can simulation flow is shown in Fig. 9. It explores the design space perform better than ZOLP. In this example, seven cells are of possible memory configurations for a given cache structure. assigned to two layers using the two methods. ZOLP leads It does so by investigating different organizational parameter to a larger area compared to the greedy method because values, such as the number of partitions in the wordline and of the remaining overlaps between cells after ZOLP layer bitline. It calculates the timing, area, and power consumption assignment. The greedy method, on the other hand, removes of the memory module based on FinFET logic and memory all the overlap during the layer assignment. It also performs libraries and technology parameter values. It finds the best better in wirelength because wirelength is often proportional configuration based on the cost function defined by the user. to the area of the design. The position of a cell after greedy We use CACTI-monolithic to model BLM and TLM mod- layer assignment is relatively close to its original position ules. We modify the area values of the cells in the FinFET in 2D deflated cell placement. Therefore, the wirelength is logic and memory libraries in order to characterize TLM close to the wirelength in deflated cell placement, as shown memory modules. To the best of our knowledge, no tool is in Table V, which has already been optimized during deflated available for characterizing GLM memory modules. Therefore, cell placement. The greedy method is also faster because it we do not evaluate such modules. Table VI shows the Cacti- is simpler and does not need a legalization step. Therefore, monolithic input parameter values for important OpenSPARC 7

V2 V2 V2 V2

V1 V1 V1 V1

V2 V2 V2 V2 (a) (b)

Fig. 10. BLM and TLM memory module area comparison: (a) BLM memory V0 V0 V0 V0 and (b) TLM memory module H2 H2

T2 memory modules. They are instruction cache data array H0 H1 (ICD), instruction cache tag array (ICT), data cache array (DCA), data cache tag array (DTA), integer register file (IRF), floating-point register file (FRF), data valid bit array (DVA), V2 V2 V2 V2 data translation lookaside buffer (DTLB), instruction transla- V1 V1 tion lookaside buffer (ITLB), and store buffer CAM (SCM). V1 V1 The rest of the memory modules are smaller RAM arrays. V2 V2 V2 V2 We assume 1 GHz clock frequency and 330 ◦K operating temperature. FA stands for fully-associative. CACTI-monolithic results are shown in Table VII. It shows Fig. 11. Layout of horizontal and vertical H-trees of a memory module the total footprint area, dynamic power, leakage power, H-tree power, and total power of the 22 memory modules. In all, TLM memory modules have 2.3% less power consumption global interconnect power consumption into account. It is and 25.3% smaller footprint area compared to those of BLM implemented in C++. The main differences between 3DFP and memory modules. TLM memory modules do not benefit from 3D-HMFP are as follows: smaller TLM cell layouts as much as TLM logic modules do, • 3DFP floorplans only 2D modules on multiple layers, both in terms of area and power consumption. Although the assuming no vertical constraints. 3D-HMFP handles both TLM cell layout is 35.7% smaller than the BLM cell layout, 2D and 3D modules and aligns parts of GLM and TLM TLM memory modules have only 25.3% smaller footprint area. modules on two layers. The footprint area benefit of TLM cell array is reduced because • 3DFP assumes wire power to be fixed at 30.0% of the both BLM and TLM have similar routing area for the same total module power. 3D-HMFP does not make such an structure, as shown in Fig. 10. assumption since wire power is significantly impacted In addition, 25.3% smaller footprint area for TLM memory by the technology node being employed. Instead, 3D- modules does not translate into a significant reduction in total HMFP calculates wire power based on the FinFET logic power consumption due to the organizational structure and library and technology-dependent wire resistance and dimensions of the memory modules. CACTI-monolithic uses capacitance values. horizontal and vertical H-tree structures to route data and • 3DFP, at a time, can only explore the design space address, as shown in Fig. 11. These H-trees dominate the of modules implemented in the same style. 3D-HMFP power of the memory modules. Their length is significantly explores a larger design space because it can replace impacted by the width, height, and organizational structure of modules with their alternative implementations. the memory module. Table VIII shows results for an instruction • 3DFP uses a B*-tree representation. 3D-HMFP uses a cache data array implemented in BLM and TLM. The module T*-tree representation for handling vertical constraints. width of both implementations is similar. However, the height of the TLM implementation is 24.4% smaller. Because the memory module width is larger than its height, horizontal H- A. Problem Formulation trees are longer and, consequently, affect the power consump- We assume that the two parts of a GLM or TLM module tion the most. Both TLM and BLM memory modules have on different layers are aligned with each other to reduce the similar horizontal H-tree lengths because their width is similar. footprint area and wirelength. Not aligning them can make the This leads to only a small reduction in power consumption for design complicated and increase routing. The main challenge TLM memory modules, mostly due to shorter vertical H-trees in hybrid-monolithic floorplanning is to make sure that both and number of repeaters. parts of every GLM and TLM module are aligned on the two layers, as shown in Fig. 12. IX.HYBRID-MONOLITHIC 3D IC FLOORPLANNER 3D-HMFP is built atop 3DFP [36], which is a thermal- B. T*-tree Representation aware floorplanner tool developed for TSV-based 3D ICs. 3D-HMFP uses a T*-tree representation that is inspired by 3D-HMFP can handle hybrid floorplanning while taking the the B*-tree representation, which is an efficient and flexible 8

TABLE VI. CACTI-MONOLITHIC INPUT PARAMETER VALUES FOR MEMORY MODULES

Input ICD&ICT DCA&DTA IRF FRF DVA DTLB ITLB SCM Memory type Cache Cache RAM RAM RAM CAM CAM CAM Cache size (bytes) 16896 9216 288 2048 128 608 304 64 Block size (bytes) 32 16 9 8 32 16 8 1 Associativity 8 4 1 1 1 FA FA FA Number of banks 4 2 1 1 1 1 1 1 Tag size (bits) 30 30 - - - 66 66 37

TABLE VII. CACTI-MONOLITHIC RESULTS

Monolithic type Area Area Dynamic Leakage H-tree Total power Total power (µm2) reduction (%) power (µW) power (µW) power (µW) (µW) reduction (%) BLM 25444 0.0 3075 3124 4349 10548 0.0 TLM 18998 25.3 2966 3115 4222 10303 2.3

TABLE VIII. INSTRUCTION CACHE DATA ARRAY DIMENSIONS OF BLM AND TLM IMPLEMENTATIONS (16.5KB (16KB DATA AND 0.5KB PARITY), 8-WAY SET ASSOCIATIVE, 32B LINESIZE, 64 ENTRIES [35]) A Monolithic type Width (µm) Height (µm) Power (µW) BLM 151 78 5757 TLM 150 59 5613 B C D

E F

G H (a) (b) (c)

Fig. 12. Monolithic 3D ﬂoorplanning showing the vertical constraints for Fig. 13. A T*-tree representation and the corresponding placement in 3D GLM/TLM modules: (a) BLM, (b) GLM & TLM, and (c) hybrid

13, swapping module H with module C is not legal since the data structure for non-slicing floorplans [37]. T*-tree has been resulting state has three layers. Hence, such an operation is used for 3D floorplanning, considering vertically-aligned recti- rejected by the algorithm. linear modules, in [18]. Fig. 13 shows a T*-tree representation and the corresponding placement of modules in 3D. Modules C. Simulated Annealing Engine D and H are assumed to be implemented in GLM or TLM. 3D-HMFP uses a simulated annealing engine to perturb Hence, they have the same footprint on the two layers. Other the floorplanning solutions. Five different operations can be modules are assumed to be implemented in BLM and thus performed on the T*-tree nodes in the floorplanning algorithm occupy space only on one layer. The T*-tree represents each for this purpose. The operations are as follows: module with a node. A node can have up to three children ◦ nodes. 1) Rotate: It rotates a module by 90 . In other words, it swaps the width and height values of a module. 3D-HMFP uses a depth-first search algorithm to pack the 2) Resize: It modifies the width and height values of a soft modules. It starts from the root node and places the module module while keeping the module area the same. on the bottom layer. At each node, it visits the middle, left, 3) Move: It moves a module to another position in the T*- and right subtrees in order. The middle child module is placed tree. on top of the parent module on the upper layer. The left child 4) Swap: It swaps the positions of two modules in the T*- module is placed to the right of the parent module. The right tree. child module is placed above the parent module on the same 5) Replace: It replaces a module with one of its alternative layer. We use two linked-list data structures (one for each implementations. For example, it can replace a GLM layer) to keep track of the boundaries of the placement and module with its BLM implementation. determine the locations of the placed modules. More details of the T*-tree representation and packing algorithm can be After each operation on the T*tree, the simulated annealing found in [18] and [37]. engine decides whether to move to a new state based on a Not every T*-tree representation corresponds to a valid weighted cost function specified by the user. placement. For example, a GLM module cannot be the middle child of another module. 3D-HMFP checks the legality of the D. Cost Function solution at each step. Any operation that leads to an invalid The goal of the floorplanning algorithm is to find a solution solution is dismissed by the algorithm. For example, in Fig. with the smallest weighted cost function, which is as follows 9

for a 2D design: SiO2 Top layer 1 μm [40] 0.1 μm Bottom layer 1 μm [40] cost = α ∗ A + β ∗ WL, (1) [41] Bulk silicon 150 μm [39] where A and WL are the area and wirelength of the design. Thus, 3D-HMFP tries to minimize the area and wirelength of Thermal interface material 20 μm [39] Heat spreader 1 mm [39] a 2D design. Heat sink 6.9 mm [39] For 3D hybrid monolithic floorplanning, we use Eq. (2): cost = α ∗ A + β ∗ WL + γ ∗ D + θ ∗ P, (2) Fig. 14. Thermal model organization where D and P are the deviation and power consumption of the design. In 3D floorplanning, different layers need to have similar dimensions in order to fully utilize the silicon area and values. HotSpot 6.0 outputs not only the temperature of each minimize the dead space. We calculate the deviation as block but also temperatures at a finer grid level specified by the user. The user can effect a trade-off between speed and = D |W1 − W2|∗|H1 − H2|/A, (3) accuracy of thermal simulation by changing the grid resolution. where W and H denote the width and height of the layers, Fig. 14 shows the thermal model organization along with respectively, and the subscripts the layer number. D is smaller the thicknesses we assume for each layer. In addition to the if the two layers have similar dimensions. We also need to floorplan, power consumption, and layer dimensions, we also add power consumption to the cost function since hybrid specify the layer heat capacity and thermal resistivity values. floorplanning can replace a module with an alternative that We use thermal properties and layer thicknesses of the default has a different power value. 3D-HMFP favors a GLM imple- thermal package in the HotSpot 6.0 distribution [39], which mentation of a module to its BLM implementation. Although is reasonable for a typical high-performance processor. We they have similar total silicon area, a GLM implementation assume 1µm thickness for silicon layers consisting of active has lower power consumption due to reduced wirelength. This silicon and metal layers for the 14nm technology node [40]. We reduces the cost function we are trying to minimize. assume 100nm inter-layer dielectric thickness between silicon layers, which is enough to eliminate the inter-layer coupling that may alter transistor behavior [41]. We report the peak E. Global Wire Power Consumption temperature of the chip in our results. Interconnects have started to dominate the power consumption of modern microprocessors. Thus, excluding interconnect power during floorplanning undermines floorplanning quality XI.RESULTS and underestimates the peak temperature. At each stage, 3D- We characterize the OpenSPARC T2 [35] processor core HMFP calculates the length of the global wires and determines using six different implementations: 2D, BLM, TLM, and the number of repeaters that need to be added. It calculates the three hybrid monolithic designs. We compute the footprint power consumption of the global wires and repeaters based on area, global wirelength, total power consumption, dead space, wire resistance and capacitance values obtained from ITRS peak temperature of the chip, footprint area reduction, global 2013 [38] and the FinFET logic library. wirelength reduction, total power reduction, and simulation 3D-HMFP only calculates the global wire power consump- run-time for each design. The results are obtained through tion. Intermediate wires are already taken into account by the same methodology, using the same tools, and under the FinPrin-monolithic and CACTI-monolithic when they charac- same 1 ns timing constraint. We treat logic modules as soft terize logic and memory modules, respectively. modules and memory modules as hard modules. Because the simulated annealing algorithm is a stochastic technique, it X.HOTSPOT only approximates the globally optimum solution. Therefore, Reduced footprint area, vertically-stacked multiple active comparing the results after a single run for each design may layers, and low thermal conductivity of the inter-layer di- not be fair. Hence, we run 100 simulations for each design electric increase the power density of monolithic 3D ICs and use the best case for comparison. We run floorplanning and lead to higher on-chip temperatures [3]. Thus, the peak experiments on a 3.10 GHz machine with 64-bit quad-core temperature can become an issue for monolithic 3D ICs due to Intel i5 processor, 8 GB DRAM, and Ubuntu 12.04 LTS their higher power density. Hence, a thermal model is needed operating system. for monolithic 3D ICs in order to accurately estimate the peak temperature and avoid hotspots on chip. We incorporate HotSpot 6.0 [11] into 3D-HMFP for thermal analysis of A. Floorplanning Results the chip. 3D-HMFP provides area and power values of the We show the floorplanning results in Table IX. We choose different modules to HotSpot 6.0. Since HotSpot 6.0 cannot the floorplan with the minimum area-power product value for handle interconnect power separately, we distribute global wire each design because both parameters are important. All designs power among modules in proportion to their areas. We use exhibit very small total dead space. Results show that a hybrid HotSpot’s grid model to obtain more accurate temperature monolithic design (HM2) offers a 48.1% reduction in footprint 10

TABLE IX. COMPARISON OF DIFFERENT MONOLITHIC DESIGNS BASED ON MINIMUM AREA-POWER PRODUCT SHOWING THE BENEFIT OF HYBRID DESIGNSINTERMSOFFOOTPRINTAREAANDPOWERCONSUMPTION

Design Area Wirelength Power Dead Temperature Area Wirelength Power Run ◦ (µm2) (µm) (µW) space (%) ( K) reduction (%) reduction (%) reduction (%) time (s) 2D 120516 971365 125442 0.3 325.2 0.0 0.0 0.0 6.2 BLM 61984 584606 116271 3.1 330.8 48.6 39.8 7.3 6.8 TLM 81171 732676 112298 1.4 327.5 32.6 24.6 10.5 6.1 HM1 (GLM logic + TLM memory) 66420 678064 105867 0.2 329.0 44.9 30.2 15.6 6.0 HM2 (GLM logic + BLM memory) 62600 755597 107134 4.1 329.8 48.1 22.2 14.6 6.3 HM3 (GLM/BLM logic + BLM memory) 61410 727442 110498 2.8 330.3 49.0 25.1 11.9 6.7 area and a 14.6% reduction in power consumption at the cost of a small increase in temperature. The 2D design has the largest footprint area and power consumption among all designs. Its peak temperature is the lowest since temperature depends on power density. We use this design as the baseline. The BLM design has 48.6% lower footprint area and 7.3% lower power consumption relative to the 2D design. Its power reduction is due to a reduction in global wire power consumption. The TLM design has 32.6% smaller footprint area relative to the 2D design since TLM cell layouts have 35.7% smaller footprint area compared to those of BLM. The power reduction is 10.5% due to shorter local and global interconnects. The HM1 design implements all logic modules in GLM and all memory modules in TLM. This design can be floorplanned using a 2D floorplanner since all modules have two layers. Its footprint area and power consumption are, respectively, 44.9% and 15.6% lower compared to those of the 2D design. HM1 (a) (b) (c) (d) offers the best power value among all designs because it saves power from both logic and memory modules. It has 8.9% less Fig. 15. Floorplanning results showing that the vertical constraints are met for power consumption compared to the BLM design mostly due TLM/GLM modules: (a) 2D, (b) BLM, (c) TLM, and (d) HM3 (GLM/BLM to the intra-module wirelength power reduction of GLM logic logic + BLM memory). Colors indicate the implementation type: Blue: BLM, brown: TLM, and green: GLM modules. The HM2 design implements all logic modules in GLM and all memory modules in BLM. It uses GLM logic modules to an average, to reach the solution compared to HM2. save footprint area and power. It uses BLM memory modules 3D-HMFP has the same thermal-aware floorplanning ability since they occupy smaller total silicon area compared to TLM as its predecessor 3DFP [36]. However, temperature reduction memory modules. Hence, the footprint area is reduced by was not significant for thermal-aware floorplanning because, in 48.1%. HM2 has 14.6% reduction in power consumption. The our designs, the peak temperature and the temperature range results are similar to those of [6] in which the OpenSPARC of different floorplans were already low for non-thermal-aware T2 core implemented in monolithic design has 50.0% smaller floorplans. footprint area and 15.6% less power consumption compared to its 2D counterpart. Similarly, the monolithic OpenSPARC Fig. 15 shows the floorplans of the 2D, BLM, TLM, T2 core design was reported to have 14.0% smaller power and HM3 designs reported in Table IX. The 2D design is consumption compared to the 2D design [14]. implemented on a single layer and the 3D designs on two (top and bottom) layers. The TLM design has the same floorplan The HM3 design uses logic modules from both BLM and on both layers since all modules have identical dimensions GLM implementations. The footprint area is reduced by 49.0% on both layers. The HM3 design has both GLM and BLM and power by 11.9% compared to those of the 2D design. modules. GLM modules have two parts and occupy the same HM3 offers the best footprint area since it can use both space on both layers. BLM and GLM logic modules to minimize the dead space. It has slightly worse power consumption than that of HM2 because it sometimes uses BLM logic modules instead of B. Floorplanning Results at Minimum Area, Wirelength, and GLM logic modules to reduce the footprint area and global Power Values wirelength. HM3 offers more flexibility than HM2 because the HM3 design space is a superset of the HM2 design space. One may have different objectives for floorplanning, such as However, exploring a larger design space increases algorithm decreasing the footprint area, reducing the power consumption, run-time slightly. In our case, HM3 takes 5.3% more time, on or reducing the wirelength for easier routing. Therefore, we 11 compare the results for the minimum area, wirelength, or C. Discussion power configurations to better understand the trade-offs. Floorplanning results show that hybrid monolithic designs Table X shows the minimum-area floorplanning results. The offer trade-offs among chip footprint area, wirelength, and various designs from Table IX are used as the baseline for power consumption. The quality of the hybrid monolithic solu- the corresponding designs from here on. Minimum footprint tion depends significantly on the number and implementation area is achieved at the cost of a higher power consumption style of the modules. For a 3D design with two layers, only and increase in total wirelength. The minimum-area 2D de- 2D modules can be on top of other 2D modules. Therefore, for sign has a 23.5% larger wirelength and 3.9% higher power quality hybrid solutions, 2D modules should be grouped into consumption. For the minimum-area TLM design, wirelength two groups with similar overall area. This was possible with and power increase by 36.3% and 6.3%, respectively, with only our benchmark, as evidenced by dead space as small as 3.4% a 1.0% footprint area reduction. The HM1 and HM2 designs in the HM2 design, in which only BLM memory modules are have less than 1.0% smaller area at the cost of around 1.0% placed on top of other BLM memory modules as logic modules power consumption increase. The HM2 design, interestingly, are implemented in GLM style. However, this may not be the has 0.8% higher power despite a 2.4% decrease in wirelength. case for all benchmarks. If 2D modules cannot be grouped This can be explained based on the number of repeaters on long into two groups with similar area values, then HM2 would global interconnects, since both HM2 designs have the same not be able to find good solutions in terms of footprint area. modules, but different global interconnects. The higher total In that case, fortunately, HM3 can still find quality solutions global wirelength does not necessarily lead to a higher power to reduce footprint area and wirelength, since it can replace consumption. The design with the higher overall wirelength but GLM logic modules with BLM logic modules. As expected, fewer long interconnects can have less power, because short replacing GLM modules with their alternatives in the HM3 interconnects do not require repeaters. The HM3 minimum- design increases the power consumption of the modules, since area design has 2.1% area reduction with 7.8% increase in GLM modules have the smallest power consumption. On the power. These results show that the minimum-area design might other hand, HM3 can reduce global wire power consumption not be the best design overall. The floorplanner may place by reducing the footprint area. Therefore, replacing GLM modules that are connected to each other with global wires modules with their alternatives may increase or decrease the far from each other to minimize area. This can increase the total power consumption depending on the benchmark. wirelength and power consumption. Monolithic designs can also be realized with more layers, although, in this work, we assume 3D designs with only Table XI shows the results for minimum wirelength. Reduc- two layers. More layers may facilitate a further reduction in ing the global wirelength can save power and make routing eas- footprint area, wirelength, and power consumption. However, ier. However, the area may increase when trying to minimize the peak temperature may increase further, since the power the wirelength. The BLM design has 26.0% less wirelength at density will increase and the upper layers will be farther from a cost of a 9.9% footprint area increase. The HM3 design can the heat sink. However, hybrid monolithic designs may become also reduce the wirelength significantly owing to its exploration even more attractive as more layers are added because of of a larger design space. It has 28.1% less wirelength, but the flexibility they offer. For example, a hybrid monolithic 4.0% footprint area and 3.6% power increase despite its shorter floorplanner can mitigate hotspots by replacing GLM modules wirelength. Increase in power is due to replacement of GLM that have high power density with their BLM alternatives. modules with BLM modules. Moreover, the 3D-HMFP cost function can be modified to place modules with high power density on the lower layers. Table XII shows the designs with the smallest power con- 3D-HMFP can easily be modified to handle more layers with sumption. Power reduction is generally obtained from wire- a few modifications, such as adding a contour for each layer to length reduction at the cost of a higher footprint area. For keep track of the placement on that layer during packing and the 2D design, power consumption decreases by 1.0% owing modifying the legalization constraint on the number of layers. to a 3.2% reduction in wirelength, but the area increases by 2.1%. The BLM design has 2.5% less power consumption at In this work, we used simulated annealing, a commonly a cost of 9.9% increase in footprint area. The HM2 design has used stochastic technique, to perturb floorplanning solutions. 2.1% power reduction, but 15.4% footprint area increase. The Simulated annealing worked well because we have only 35 HM3 design has 3.8% less power consumption with a 4.2% modules in our design. However, stochastic techniques can increase in footprint area. Unlike other cases, HM3 can save suffer from scalability issues if the number of modules is power by replacing BLM modules with GLM ones at the cost significantly higher. In such cases, algorithms based on non- of an increase in footprint area. stochastic approaches, such as deferred decision making [42], can be used. Overall, highly optimizing a single design objective value Fabrication cost is an issue with monolithic designs. In may come at the cost of degrading other design objective this work, we have not taken fabrication cost and challenges values significantly. Therefore, the designer needs to select into account. BLM and GLM designs require additional inter- the cost function and the coefficients of the specific design active-layer metal layers to connect intra-module cells. On the objectives carefully in order to avoid exaggerating the effect other hand, a TLM design requires much fewer metal layers of only one objective. between active layers. A recent study [43] estimates that a 12

TABLE X. MINIMUM AREA CONFIGURATIONS OF DIFFERENT MONOLITHIC DESIGNS

Design Area Wirelength Power Dead Temperature Area Wirelength Power ◦ (µm2) (µm) (µW) space (%) ( K) reduction (%) reduction (%) reduction (%) 2D 120132 1199780 130388 0.0 325.4 0.3 -23.5 -3.9 BLM 60888 729625 119870 1.4 331.5 1.8 -24.8 -3.1 TLM 80358 998886 119333 0.4 328.3 1.0 -36.3 -6.3 HM1 (GLM logic + TLM memory) 66305 754492 107121 0.0 329.3 0.2 -11.3 -1.2 HM2 (GLM logic + BLM memory) 62150 737539 108043 3.4 329.9 0.7 2.4 -0.8 HM3 (GLM/BLM logic + BLM memory) 60114 785678 119116 0.2 330.2 2.1 -8.0 -7.8

TABLE XI. MINIMUM WIRELENGTH CONFIGURATIONS OF DIFFERENT MONOLITHIC DESIGNS

Design Area Wirelength Power Dead Temperature Area Wirelength Power ◦ (µm2) (µm) (µW) space (%) ( K) reduction (%) reduction (%) reduction (%) 2D 122000 869321 124983 1.6 325.0 -1.2 10.5 0.4 BLM 68115 432366 113402 11.8 329.4 -9.9 26.0 2.5 TLM 84671 672989 111393 5.4 327.1 -4.3 8.1 0.8 HM1 (GLM logic + TLM memory) 67262 628482 104832 1.4 328.9 -1.3 7.3 1.0 HM2 (GLM logic + BLM memory) 64064 613726 105918 6.3 329.5 -2.3 18.8 1.1 HM3 (GLM/BLM logic + BLM memory) 63856 523247 114516 5.9 330.3 -4.0 28.1 -3.6

TABLE XII. MINIMUM POWER CONSUMPTION CONFIGURATIONS OF DIFFERENT MONOLITHICDESIGNS

Design Area Wirelength Power Dead Temperature Area Wirelength Power ◦ (µm2) (µm) (µW) space (%) ( K) reduction (%) reduction (%) reduction (%) 2D 123057 940181 124159 2.4 324.9 -2.1 3.2 1.0 BLM 68115 432366 113402 11.8 329.4 -9.9 26.0 2.5 TLM 84903 688873 110570 5.7 327.2 -4.6 6.0 1.5 HM1 (GLM logic + TLM memory) 69296 631438 104236 4.3 328.4 -4.3 6.9 1.5 HM2 (GLM logic + BLM memory) 72240 657147 104907 16.9 328.1 -15.4 13.0 2.1 HM3 (GLM/BLM logic + BLM memory) 63960 666829 106284 6.7 329.4 -4.2 8.3 3.8

TLM design is 23.0% less costly compared to a GLM design, [2] J. S. Clarke, C. George, C. Jezewski, A. M. Caro, D. Michalak, and thanks to fewer metal layers and fabrication steps in the TLM J. Torres, “Process technology scaling in an increasingly interconnect fabrication process. Although, based on objectives pursued in dominated world,” in Proc. Int. Symp. VLSI Technology, June 2014, pp. 1–2. this work, a TLM design does not seem to be promising, it [3] J. Cong, J. Wei, and Y. Zhang, “A thermal-driven floorplanning algo- may be advantageous if fabrication cost is taken into account. rithm for 3D ICs,” in Proc. Int. Conf. Comput.-Aided Des., Nov. 2004, Thus, more work needs to be done on incorporating fabrication pp. 306–313. cost in the floorplanning cost function of monolithic designs. [4] K. Vaidyanathan, D. H. Morris, U. E. Avci, I. S. Bhati, L. Subramanian, J. Gaur, H. Liu, S. Subramoney, T. Karnik, H. Wang, and I. A. Young, “Overcoming interconnect scaling challenges using novel process and XII.CONCLUSION design solutions to improve both high-speed and low-power computing We introduced a 3D hybrid monolithic floorplanner in this modes,” in Proc. IEEE Int. Electron Devices Meeting, Dec. 2017. paper. We characterized the OpenSPARC T2 processor core [5] P. Kapur, G. Chandra, and K. C. Saraswat, “Power estimation in global in different monolithic designs and compared their footprint interconnects and its reduction using a novel repeater optimization Proc. Des. Automat. Conf. area, wirelength, power, and temperature values. We showed, methodology,” in , June 2002, pp. 461–466. via simulations, that hybrid monolithic designs offer interesting [6] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “Design and CAD methodologies for low power gate-level monolithic 3D ICs,” in Proc. trade-offs among different design objectives, such as footprint Int. Symp. Low Power Electron. & Des., Aug. 2014, pp. 171–176. area, wirelength, and power consumption. We showed that, [7] C. Liu and S. K. Lim, “A design tradeoff study with monolithic 3D relative to the 2D design, a 3D hybrid monolithic design can integration,” in Proc. Int. Symp. Qual. Electron. Des., Mar. 2012, pp. reduce footprint area by 48.1% and power consumption by 529–536. 14.6%. [8] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “Power-performance study In our future work, we plan to explore the benefits of of block-level monolithic 3D-ICs considering inter-tier performance hybrid designs in more detail using commercial tools and an variations,” in Proc. Des. Automat. Conf., June 2014, pp. 1–6. RTL-to-GDSII physical design flow. We also plan to build [9] Y. Yang and N. K. Jha, “FinPrin: FinFET logic circuit analysis and an architectural framework to analyze different monolithic optimization under PVT variations,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 12, pp. 2462–2475, Dec. 2014. designs at the system level with multi-core processors and network-on-chip. We will develop the necessary EDA tools [10] C.-Y. Lee and N. K. Jha, “FinCANON: A PVT-aware integrated delay and power modeling framework for FinFET-based caches and on-chip that will allow us to explore the monolithic design space at networks,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, the multi-core level. no. 5, pp. 1150–1163, May 2014. [11] R. Zhang, M. R. Stan, and K. Skadron, “HotSpot 6.0: Validation, acceleration and extension,” University of Virginia, Tech. Report, CS- EFERENCES R 2015-04, Aug. 2015. [1] R. Ho, K. W. Mai, and M. A. Horowitz, “The future of wires,” Proc. [12] S. Bobba, A. Chakraborty, O. Thomas, P. Batude, and G. de Micheli, IEEE, vol. 89, no. 4, pp. 490–504, Apr. 2001. “Cell transformations and physical design techniques for 3D monolithic 13

integrated circuits,” ACM J. Emerg. Technol. Comput. Syst., vol. 9, no. 3, [32] C. Chu and Y.-C. Wong, “FLUTE: Fast lookup table based rectilinear pp. 19:1–19:28, Oct. 2013. Steiner minimal tree algorithm for VLSI design,” IEEE Trans. Comput.- [13] M. Jung, T. Song, Y. Peng, and S. K. Lim, “Design methodologies for Aided Des. Integr. Circuits Syst., vol. 27, no. 1, pp. 70–83, Jan. 2008. low-power 3-D ICs with advanced tier partitioning,” IEEE Trans. Very [33] C. Yan and E. Salman, “Mono3D: Open source cell library for mono- Large Scale Integr. (VLSI) Syst., vol. 25, no. 7, pp. 2109–2117, July lithic 3-D integrated circuits,” IEEE Trans. Circuits & Syst. I, vol. 65, 2017. no. 3, pp. 1075–1085, Mar. 2018. [14] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “Shrunk-2D: A physical [34] B. W. Ku, T. Song, A. Nieuwoudt, and S. K. Lim, “Transistor-level design methodology to build commercial-quality monolithic 3-D ICs,” monolithic 3D standard cell layout optimization for full-chip static IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 36, no. 10, power integrity,” in Proc. Int. Symp. Low Power Electron. & Des., July pp. 1716–1724, Oct. 2017. 2017, pp. 1–6. [15] J. H. Law, E. F. Young, and R. L. Ching, “Block alignment in 3D [35] Oracle, “OpenSPARC T2,” 2007. [Online]. Available: floorplan using layered TCG,” in Proc. Great Lakes Symp. VLSI, Apr.- http://www.oracle.com May 2006, pp. 376–380. [36] W.-L. Hung, G. M. Link, Y. Xie, N. Vijaykrishnan, and M. J. Irwin, “In- [16] X. Li, Y. Ma, and X. Hong, “A novel thermal optimization flow using terconnect and thermal-aware floorplanning for 3D microprocessors,” in incremental floorplanning for 3D ICs,” in Proc. Asia South Pacific Des. Proc. Int. Symp. Qual. Electron. Des., Mar. 2006, pp. 98–104. Automat. Conf., Jan. 2009, pp. 347–352. [37] Y.-C. Chang, Y.-W. Chang, G.-M. Wu, and S.-W. Wu, “B*-Trees: A [17] R. K. Nain and M. Chrzanowska-Jeske, “Fast placement-aware 3-D new representation for non-slicing floorplans,” in Proc. Des. Automat. floorplanning using vertical constraints on sequence pairs,” IEEE Trans. Conf., June 2000, pp. 458–463. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 9, pp. 1667–1680, [38] ITRS, “International Technology Roadmap for Semiconductor,” 2013. Sept. 2011. [Online]. Available: http://www.itrs2.net/2013-itrs.html [18] A. Quiring, M. Lindenberg, M. Olbrich, and E. Barke, “3D floorplan- [39] HotSpot, “HotSpot 6.0 temperature modeling tool,” 2015. [Online]. ning considering vertically aligned rectilinear modules using T*-tree,” Available: http://lava.cs.virginia.edu/HotSpot/ in Proc. IEEE Int. 3D Syst. Integration Conf., Jan. 2012, pp. 1–5. [40] S. K. Samal, S. Panth, K. Samadi, M. Saeidi, Y. Du, and S. K. [19] J. Knechtel, E. F. Young, and J. Lienig, “Planning massive interconnects Lim, “Adaptive regression-based thermal modeling and optimization for in 3-D chips,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., monolithic 3-D ICs,” IEEE Trans. Comput.-Aided Des. Integr. Circuits vol. 34, no. 11, pp. 1808–1821, Nov. 2015. Syst., vol. 35, no. 10, pp. 1707–1720, Oct. 2016. [20] J.-M. Lin, P.-Y. Chiu, and Y.-F. Chang, “SAINT: Handling module [41] P. Batude, M. A. Jaud, O. Thomas, L. Clavelier, A. Pouydebasque, folding and alignment in fixed-outline floorplans for 3D ICs,” in Proc. M. Vinet, S. Deleonibus, and A. Amara, “3D CMOS integration: Int. Conf. Comput.-Aided Des., Nov. 2016, pp. 1–7. Introduction of dynamic coupling and application to compact and robust [21] Y. Chen, G. Sun, Q. Zou, and Y. Xie, “3DHLS: Incorporating high-level 4T SRAM,” in Proc. Int. Conf. Integr. Circuit Des. and Technology and synthesis in physical planning of three-dimensional (3D) ICs,” in Proc. Tutorial, June 2008, pp. 281–284. Des. Automat. & Test Europe Conf., Mar. 2012, pp. 1185–1190. [42] J. Z. Yan and C. Chu, “DeFer: Deferred decision making enabled [22] R. S. Patti, “Three-dimensional integrated circuits and the future of fixed-outline floorplanning algorithm,” IEEE Trans. Comput.-Aided Des. system-on-chip designs,” Proc. IEEE, vol. 94, no. 6, pp. 1214–1224, Integr. Circuits Syst., vol. 29, no. 3, pp. 367–381, Mar. 2010. June 2006. [43] J. Shi, D. Nayak, S. Banna, R. Fox, S. Samavedam, S. K. Samal, and [23] D. H. Kim, K. Athikulwongse, and S. K. Lim, “Study of through- S. K. Lim, “A 14nm FinFET transistor-level 3D partitioning design silicon-via impact on the 3-D stacked IC layout,” IEEE Trans. Very to enable high-performance and low-cost monolithic 3D IC,” in Proc. Large Scale Integr. (VLSI) Syst., vol. 21, no. 5, pp. 862–874, May 2013. IEEE Int. Electron Devices Meeting, Dec. 2016, pp. 2–5. [24] P. Batude, M. Vinet, B. Previtali, C. Tabone, C. Xu, J. Mazurier, O. We- ber, F. Andrieu, L. Tosti, L. Brevard, B. Sklenard, P. Coudrain, S. Bobba, H. B. Jamaa, P. E. Gaillardon, A. Pouydebasque, O. Thomas, C. L. Royer, J. M. Hartmann, L. Sanchez, L. Baud, V. Carron, L. Clavelier, G. de Micheli, S. Deleonibus, O. Faynot, and T. Poiroux, “Advances, challenges and opportunities in 3D CMOS sequential integration,” in Proc. IEEE Int. Electron Devices Meeting, Dec. 2011, pp. 7.3.1–7.3.4. [25] X. Dai and N. K. Jha, “Improving convergence and simulation time of quantum hydrodynamic simulation: Application to extraction of best 10-nm FinFET parameter values,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 1, pp. 319–329, Jan. 2017. [26] Intel Corp., “Intel 14nm technology,” 2014. [On- line]. Available: https://www.intel.com/content/www/us/en/silicon- innovations/intel-14nm-technology.html [27] Samsung Foundry, “14nm FinFET technology,” 2015. [Online]. Available: http://www.samsung.com/semiconductor/foundry/process- technology/14nm/ [28] Synopsys Inc., “Sentaurus TCAD tool suite, version I-2013.12,” 2013. [Online]. Available: htttp://www.synopsys.com Abdullah Guler (S’15) received his B.S. degree in Electrical and Electronics Engineering from Bilkent [29] J. K. Ousterhout, G. T. Hamachi, R. N. Mayo, W. S. Scott, and G. S. University at Ankara, Turkey in 2013, and M.A. Taylor, “The Magic VLSI layout system,” IEEE Des. Test Comput., degree in Electrical Engineering from Princeton vol. 2, no. 1, pp. 19–30, Feb. 1985. University at Princeton, NJ in 2015, where he is [30] Synopsys Inc., “Synopsys Design Compiler,” 2013. [Online]. Available: currently pursuing his Ph.D. degree in Electrical htttp://www.synopsys.com Engineering. [31] J. A. Roy, S. N. Adya, D. A. Papa, and I. L. Markov, “Min-cut His current research interests include monolithic floorplacement,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., 3D IC design and FinFET-based SRAM design. vol. 25, no. 7, pp. 1313–1326, July 2006. 14

Niraj K. Jha (S’85-M’85-SM’93-F’98) received his B.Tech. degree in Electronics and Electrical Communication Engineering from Indian Institute of Technology, Kharagpur, India in 1981, M.S. degree in Electrical Engineering from S.U.N.Y. at Stony- Brook, NY in 1982, and Ph.D. degree in Electrical Engineering from University of Illinois at Urbana- Champaign, IL in 1985. He is a Professor of Elec- trical Engineering at Princeton University. Prof. Jha served as the Editor-in-Chief of IEEE Transactions on VLSI Systems and an Associate Editor of IEEE Transactions on Circuits and Systems I and II, IEEE Trans- actions on VLSI Systems, IEEE Transactions on Computer-Aided Design, IEEE Transactions on Computers, Journal of Electronic Testing: Theory and Applications, and Journal of Nanotechnology. He is currently serving as an Associate Editor of IEEE Transactions on Multi-Scale Computing Systems and Journal of Low Power Electronics. He has also served as the Program Chairman of the 1992 Workshop on Fault-Tolerant Parallel and Distributed Systems, the 2004 International Conference on Embedded and Ubiquitous Computing, and the 2010 International Conference on VLSI Design. He has served as the Director of the Center for Embedded System-on-a-chip Design funded by New Jersey Commission on Science and Technology and as the Associate Director of the Andlinger Center for Energy and the Environment. He is a Fellow of IEEE and ACM. He received the Distinguished Alumnus Award from I.I.T., Kharagpur in 2014. He is the recipient of the AT&T Foun- dation Award and NEC Preceptorship Award for research excellence, NCR Award for teaching excellence, six Outstanding Teaching Commendations, and Princeton University Graduate Mentoring Award. He has co-authored or co-edited ﬁve books titled Testing and Reliable De- sign of CMOS Circuits (Kluwer, 1990), High-Level Power Analysis and Op- timization (Kluwer, 1998), Testing of Digital Systems (Cambridge University Press, 2003), Switching and Finite Automata Theory, 3rd edition (Cambridge University Press, 2009), and Nanoelectronic Circuit Design (Springer, 2010). He has also authored 15 book chapters. He has authored or co-authored more than 430 technical papers. He has coauthored 14 papers, which have won various awards. These include the Best Paper Award at ICCD’93, FTCS’97, ICVLSID’98, DAC’99, PDCS’02, ICVLSID’03, CODES’06, ICCD’09, and CLOUD’10. A paper of his was selected for “The Best of ICCAD: A collection of the best IEEE International Conference on Computer-Aided Design papers of the past 20 years,” two papers by IEEE Micro Magazine as one of the top picks from the 2005 and 2007 Computer Architecture conferences, and two others as being among the most inﬂuential papers of the last 10 years at IEEE Design Automation and Test in Europe Conference. He has co-authored another six papers that have been nominated for best paper awards. He has received 17 U.S. patents. He has served on the program committees of more than 170 conferences and workshops. His research interests include monolithic 3D IC design, low power hard- ware/software design, computer-aided design of integrated circuits and systems, embedded computing, secure computing, machine learning, and smart healthcare. He has given several keynote speeches in the area of nanoelectronic design/test and smart healthcare.