<<

Article Multi-Tier 3D IC Physical Design with Analytical Quadratic Partitioning Algorithm Using 2D P&R Tool

Azwad Tamir 1 , Milad Salem 1 , Jie Lin 1, Qutaiba Alasad 2 and Jiann-shiun Yuan 1,*

1 Department of Electrical and Engineering, University of Central Florida, Orlando, FL 32816, USA; [email protected] (A.T.); [email protected] (M.S.); [email protected] (J.L.) 2 Department of Petroleum Systems and Control Engineering, Tikrit University, Al-Qadissiya, Tikrit P.O. Box 42, Iraq; [email protected] * Correspondence: [email protected]

Abstract: In this study, we developed a complete flow for the design of monolithic 3D ICs. We have taken the register-transfer level netlist of a circuit as the input and synthesized it to construct the gate-level netlist. Next, we partitioned the circuit using custom-made partitioning algorithms and implemented the place and route flow of the entire 3D IC by repurposing 2D electronic design automation tools. We implemented two different partitioning algorithms, namely the min-cut and the analytical quadratic (AQ) algorithms, to assign the cells in different tiers. We applied our flow on three different benchmark circuits and compared the total power dissipation of the 3D designs with their 2D counterparts. We also compared our results with that of similar works and obtained significantly better performance. Our two-tier 3D flow with AQ partitioner obtained 37.69%, 35.06%,

 and 12.15% power reduction compared to its 2D counterparts on the advanced encryption standard,  floating-point unit, and fast Fourier transform benchmark circuits, respectively. Finally, we analyzed

Citation: Tamir, A.; Salem, M.; Lin, J.; the type of circuits that are more applicable for a 3D layout and the impact of increasing the number Alasad, Q.; Yuan, J.-s. Multi-Tier 3D of tiers of the 3D design on total power dissipation. IC Physical Design with Analytical Quadratic Partitioning Algorithm Keywords: 3D ; analytical quadratic; minimum cut; MIV; monolithic 3D IC; physical Using 2D P&R Tool. Electronics 2021, design; placement; tier partitioning 10, 1930. https://doi.org/10.3390/ electronics10161930

Academic Editor: 1. Introduction Esteban Tlelo-Cuautle The continuing trend of rapid reduction in the ’s size is causing a large increase in the density of the digital IC, resulting in severe problems in placing low para- Received: 31 March 2021 sitic interconnects within the standard cells. This reduction of transistor size is essential Accepted: 8 August 2021 Published: 11 August 2021 as it would allow more to be fit on a chip of the same dimensionality and help manufacture smaller electronics. Moreover, we are nearing the limit of transistor

Publisher’s Note: MDPI stays neutral dimensionality reduction, making it increasingly difficult to keep up with the golden rule with regard to jurisdictional claims in of the industry, i.e., Moore’s law [1]. One potential solution to these problems is the three- published maps and institutional affil- dimensional IC layout structure, which can contribute to ultra-high-density ICs, leading to iations. higher performance and lower overall power dissipation. The evolution of the IC physical design first produced the 2.5D structure, which consists of an interposer layer, with individual die placed on top of it [2]. This served as a steppingstone for the design of the 3D IC, which consisted of silicon die placed vertically on top of each other. The individual dies are referred to as tiers and they are Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. made up of a silicon substrate, followed by a device layer with an interconnect layer on This article is an open access article the top [2–4]. The early 3D designs involved the tiers being fabricated separately and distributed under the terms and placed on top of each other and connected with the help of vertical interconnects called conditions of the Creative Commons through-silicon vias (TSVs) [5–11]. However, there were several disadvantages to the usage Attribution (CC BY) license (https:// of TSVs compared to the utilization of the more recent technology of monolithic inter-tier creativecommons.org/licenses/by/ vias (MIVs). The large pitch and size of the TSV results in high parasitic and 4.0/). resistance, and limits the density with which they could be placed. Conversely, MIVs are

Electronics 2021, 10, 1930. https://doi.org/10.3390/electronics10161930 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 1930 2 of 15

orders of magnitude smaller in comparison and comparable in dimensions to a regular inter-tier via, which is less than 100 nm in diameter [12]. The monolithic IC is made by fabricating each tier sequentially instead of putting them together after fabrication, as done with TSV-based designs. Significant work has been done on the monolithic 3D IC design in recent times, most of which contributed to developing various design concepts of 3D ICs. The studies in this field include developing a face-to-face stacked heterogeneous 3D IC structure [13], exploring various microfluidic cooling mechanisms for 3D ICs [14], studying recent developments in monolithic 3D ICs and the high density and performance benefits [15], designing a logic-on-memory processor using monolithic 3D (M3D) IC techniques [16], developing a design and testing system for M3D ICs [17], comparing the TSV-based 3D structure with M3Ds [18,19], studying the effect of process variation on the performance of M3D ICs [20,21], and developing effective gate-sizing methods to boost circuit speed while considering intra-die process variation [22]. Other related studies include repurposing various components of commercial 2D P&R tools to implement 3D ICs [23–26]. However, there are not many works related to the development of a complete RTL-to-GDS physical design flow for the M3D structure due to the unavailability of commercial EDA tools for M3D P&R implementation. The limited study in this area includes the design of an overall flow developed by H. Park et al. and the development of M3D logic by Y. Lee et al. [27–29]. Our work involves an easy-to-use RTL-to-GDS design flow with custom- made tier partitioning algorithms. We analyzed the results obtained and compared them with that of related works and found a significant performance increase. The major contributions of our work are provided below: 1. We developed two different tier partitioning systems named the minimum cut (min- cut) and analytical quadratic (AQ) algorithms and compared their results with related works. We also compared the results obtained on the same circuit with the AQ and min-cut partitioning algorithms. 2. We designed a complete RTL-to-GDS design flow that is easily implementable using existing EDA tools. We used the Design Compiler (DC) and the IC Compiler II (ICC2) tool from Synopsys® for performing the logic synthesis and P&R implementations, respectively. 3. We evaluated power dissipation results for two-to-five-tier designs for three bench- mark circuits using two different tier partitioning algorithms and compared the results with a 2D layout. Finally, we examined the power dissipation reductions of increasing the number of tiers for different circuit types. We presented the details of our overall design flow, our tier partitioning algorithms, and the P&R flow system in Section2. We reported and analyzed our results in Section3. Finally, we discussed our findings in Section4, and followed up with concluding remarks in Section5.

2. Methodology 2.1. Overall Design Flow The pipeline for the overall design is shown in Figure1. The input of the system is a register-transfer level (RTL) netlist of the circuit. The Design Compiler tool from Synopsys® was used to synthesize the circuit into a gate-level netlist, which was then passed into the tier-partitioning algorithm, which provides the individual tier netlists. The details of the tier-partitioning algorithm are discussed in Section 2.2. The individual tier netlists are then passed to the Synopsys® ICC2 tool for the placement and routing (P&R) flow. The power dissipation results of each tier are generated from the P&R design and added together to get the overall result of the 3D IC. Each tier in the design is of the same size and is reduced depending on the number of tiers in the 3D design. For example, if it is a two-tier design, then the total core area of each tier is half the size of the original 2D design. Electronics 2021, 10, x FOR PEER REVIEW 3 of 16

Electronics 2021, 10, x FOR PEER REVIEW 3 of 16

power dissipation results of each tier are generated from the P&R design and added to- powergether todissipation get the overall results result of each of the tier 3D ar eIC. generated Each tier from in the the design P&R isdesign of the andsame added size and to- Electronics 2021, 10, 1930 getheris reduced to get depending the overall on result the number of the 3D of IC.tiers Each in the tier 3D in design.the design For isexample, of the same if it issize a3 two- ofand 15 istier reduced design, depending then the total on thecore number area of ofeach tiers tier in is the half 3D the design. size of For the example, original 2Dif it design. is a two- tier design, then the total core area of each tier is half the size of the original 2D design.

Figure 1. Overall design flow chart. Figure 1.1. Overall design flowflow chart. 2.2. Tier Partitioning 2.2. Tier Partitioning 2.2. TierOne Partitioning of the main challenges of 3D IC design is partitioning the cells between the tiers. One of the main challenges of 3D IC design is partitioning the cells between the tiers. Intuitively,One of thisthe mainstep haschallenges the most of impact 3D IC design on the isfinal partitioning performance the cellsof the between physical the design tiers. Intuitively, this step has the most impact on the final performance of the physical design Intuitively,since it explicitly this step affects has thethe most contents impact of eachon the tier. final A performancesubpar tier partitioning, of the physical such design as a since it explicitly affects the contents of each tier. A subpar tier partitioning, such as a sincerandom it explicitly tier partitioning, affects thewould contents spread of cells each between tier. A arbitrarysubpar tier tiers partitioning, and would suchdisregard as a random tier partitioning, would spread cells between arbitrary tiers and would disregard randomfour main tier considerations: partitioning, would (1) their spread connection cells betweens to different arbitrary nets, tiers (2) and their would location disregard after four main considerations: (1) their connections to different nets, (2) their location after fourplacement, main considerations: (3) the number (1)of MIVs,their connectionand (4) thes total to different area of the nets, tiers (2) after their partitioning. location after In placement, (3) the number of MIVs, and (4) the total area of the tiers after partitioning. In placement, (3) the number of MIVs, and (4) the total area of the tiers after partitioning. In order to add these considerations toto the tier partitioning step,step, thisthis workwork implementedimplemented twotwo approachesorder to add for these tiertier considerations partitioning;partitioning; first,first, to the aa minimum-cuttierminimum-cut partitioning algorithm,algorithm, step, this whichworkwhich implemented hashas traditionallytraditionally two beenapproaches used forfor for partitioningpartitioning tier partitioning; thethe graphgraph first, a ofof mi thethenimum-cut netlist,netlist, andand algorithm, second,second, anwhichan analyticalanalytical has traditionally quadraticquadratic algorithm,been used for a novel partitioning algorithm the that graph uses of placementthe netlist, results and second, in orderorder an totoanalytical decidedecide which whichquadratic cellcell belongsalgorithm, to whicha novel tier.tier. algorithm that uses placement results in order to decide which cell belongsFigure to which2 2 compares compares tier. thethe twotwo methods methods forfor tier tier partitioning partitioning and and depicts depicts that that the the main main differencesFigure between 2 compares the the twothe two twoalgorithms algorithms methods are are for two two-foldtier-fold partitioning and and are are the and the type typedepicts of the of thealgorithmthat algorithm the main and anddifferencesthe optimization the optimization between goal the goalof two the of algorithms thealgorithm. algorithm. are The two Themin-cut-fold min-cut and algorithm are algorithm the typedirectly directly of the performs algorithm performs on and the on thenetlist netlistoptimization graph graph and andgoaloptimizes optimizes of the the algorithm. number the number ofThe co ofmin-cutnnections connections algorithm that thatwould woulddirectly be severed be performs severed due duetoon par- the to partitioning.netlisttitioning. graph On and Onthe theotheroptimizes other hand, hand, the the number theplacement placement of co algorithmnnections algorithm thatworks workswould with withbe netlist severed netlist placement due placement to par-and andtitioning.optimizes optimizes On the the wire the other wirelength hand, length between the between placement the placed the algorithm placed cells. The cells. works implications The with implications netlist of these placement of two these optimi- twoand optimizationoptimizeszation strategies the strategieswire are length discussed are between discussed in the the following in placed the following cells. subsections. The subsections. implications of these two optimi- zation strategies are discussed in the following subsections.

Figure 2. Comparison of the two implementedimplemented tier partitioningpartitioning algorithms.algorithms. Figure 2. Comparison of the two implemented tier partitioning algorithms. 2.2.1. Partitioning Using the Minimum-Cut Algorithm

The min-cut algorithm was derived from the multilevel k-way partitioning graph (MKWPG) algorithm, which comprises three phases: coarsening, initial partitioning, and uncoarsening and refinement [30,31]. The process begins by coarsening the graph recur- sively until a graph with a small enough size is derived. More precisely, the input graph is replaced with a smaller graph version that is similar to the original graph but has fewer edges and nodes compared to the former. The process is repeated until a sufficiently small Electronics 2021, 10, x FOR PEER REVIEW 4 of 16

2.2.1. Partitioning Using the Minimum-Cut Algorithm The min-cut algorithm was derived from the multilevel k-way partitioning graph (MKWPG) algorithm, which comprises three phases: coarsening, initial partitioning, and uncoarsening and refinement [30,31]. The process begins by coarsening the graph recur- Electronics 2021, 10, 1930 sively until a graph with a small enough size is derived. More precisely, the input graph4 of 15 is replaced with a smaller graph version that is similar to the original graph but has fewer edges and nodes compared to the former. The process is repeated until a sufficiently small version of the graph is obtained that could be partitioned easily and quickly. After the coarseningversion of thephase graph is achieved, is obtained the thatinitial could partitioning be partitioned begins to easily partition and quickly.this small After version the ofcoarsening the graph phase to k isparts. achieved, As this the coarsest initial partitioning partitioned begins graph to is partition somehow this like small the versionparent graph,of the graph it could to kbe parts. partitioned As this coarsestin the same partitioned manner. graph Lastly, is somehow in the uncoarsening like the parent step graph, the originalit could begraph partitioned is partitioned in the via same reversely manner. repeating Lastly, in this the process. uncoarsening Figure step 3 illustrates the original the threegraph main is partitioned steps of the via MKWPG reversely algorithm. repeating this process. Figure3 illustrates the three main steps of the MKWPG algorithm.

Figure 3. Multilevel k-way partitioning graph algorithm steps.

Inherently, aa netlist isis similar to a graph wi withth nodes representing the gates, and edges representing the connections within the nets. Therefore, an algorithmalgorithm such as min-cut can be leveraged to modify a netlist. Since the netlist is represented as a graph, the algorithm be leveraged to modify a netlist. Since the netlist is represented as a graph, the algorithm would directly work with the connections between the cells, which are represented as the would directly work with the connections between the cells, which are represented as the edges of the graph. Therefore, using the min-cut algorithm would allow the tier partitioner edges of the graph. Therefore, using the min-cut algorithm would allow the tier parti- to explicitly take into account the connections between the cells and the number of MIVs, tioner to explicitly take into account the connections between the cells and the number of which would replace the cut connections, heeding two of the main four considerations. MIVs, which would replace the cut connections, heeding two of the main four considera- To implement the algorithm using a high-level language, the netlist graph needs to tions. be represented by an adjacency matrix. The rows and columns represent vertices and the To implement the algorithm using a high-level language, the netlist graph needs to nonzero values in the adjacency matrix are edges where there are connections within the be represented by an adjacency matrix. The rows and columns represent vertices and the netlist. The breadth-first search (BFS) algorithm was used to compute the graph in the nonzero values in the adjacency matrix are edges where there are connections within the adjacency matrix. The matrix was then divided into k parts in a row-wise manner. These netlist. The breadth-first search (BFS) algorithm was used to compute the graph in the row blocks (k parts) were then processed where each row block representing a partition adjacencyof the graph. matrix. The nodesThe matrix were was exchanged then divided among into the k partitions parts in a repeatedlyrow-wise manner. to balance These the rownumber blocks of nodes(k parts) between were then the parts processed and minimize where each the numberrow block of cutrepresenting edges. a partition of theThe graph. min-cut The nodes partitioner were startsexchanged by using amon Pythong the partitions to parse therepeatedly gate-level to balance Verilog the cir- numbercuit and of convert nodes itbetween into a graph.the parts The and wires minimize and the the gates number are of represented cut edges. as edges and nodes,The respectively. min-cut partitioner Note that, starts since by the using sizes Pyth of theon to gate parse types the are gate-level different, Verilog we weighed circuit andthe convertedconvert it into gates a basedgraph. onThe the wires number and the of required gates are transistors represented as as described: edges and inverter, nodes, respectively.buffer/NAND/NOR, Note that, AND/OR, since the sizes XOR, of and the XNORgate types as 2, are 4, 6,different, 12, and 14,we respectively.weighed the con- Cur- vertedrently, wegates are focusedbased on on staticthe number CMOS logic,of required so the partitioner transistors only as worksdescribed: on static inverter, logic. This process is vital as the circuit size depends on the total number of transistors and not the total number of gates. We first convert/parse the Verilog circuit (including gates and wires) into a graph (including nodes and edges). During the conversion, we created a dictionary that has information about the Verilog circuit, e.g., gate type, gate name, gate location, wire direction, and number of transistors. We get the exact number of transistors for each gate type from the library since the library has all the gate information, including the number of transistors, which is very important to partition the circuit in an efficient way. Once the partitioning is done, we convert the partitioned parts back to partitioned circuits. Electronics 2021, 10, x FOR PEER REVIEW 5 of 16

buffer/NAND/NOR, AND/OR, XOR, and XNOR as 2, 4, 6, 12, and 14, respectively. Cur- rently, we are focused on static CMOS logic, so the partitioner only works on static logic. This process is vital as the circuit size depends on the total number of transistors and not the total number of gates. We first convert/parse the Verilog circuit (including gates and wires) into a graph (including nodes and edges). During the conversion, we created a dic- tionary that has information about the Verilog circuit, e.g., gate type, gate name, gate lo- cation, wire direction, and number of transistors. We get the exact number of transistors for each gate type from the library since the library has all the gate information, including Electronics 2021, 10, 1930 the number of transistors, which is very important to partition the circuit in an efficient5 of 15 way. Once the partitioning is done, we convert the partitioned parts back to partitioned circuits. ToTo clearly clearly explain explain this, this, we we useuse thethe smallsmall circuitcircuit shownshown inin FigureFigure4 4,, which which has has six six gates. gates. IfIf wewe partition partition the the circuit circuit into into two two parts parts without without weighing weighing the gatesthe gates based based on the on gate the type, gate thentype, the then algorithm the algorithm will partition will partition the circuit, the ci asrcuit, shown as shown with the with red the dotted red line,dotted in whichline, in eachwhich part each will part have will three have gates three with gates two with cut edges two cut (wires) edges as a(wires) minimum. as a Thisminimum. makes This the sizemakes of thethe rightsize of part the (part right 2) part larger (part than 2) larger the size than of the size left partof the (part left 1)part by (part about 1) 2.7 by timesabout since2.7 times the number since the of number transistors of transistors in part 1 is 14,in part while 1 is it is14, 38 while in part it is 2. 38 However, in part 2. if weHowever, weigh eachif we gate weigh type each before gate converting type befo tore theconverting graph, then to the part graph, 1 will then have part 4 gates 1 will (26 have transistors) 4 gates and(26 transistors) part 2 will haveand part 2 gates 2 will (26 have transistors), 2 gates (26 as transistors), shown by the as shown dotted by green the line.dotted Hence, green theline. process Hence, allows the process us to partitionallows us the to partition circuit into the k circuit parts with into equalk parts size with and equal a minimum size and numbera minimum of cut number edges. of cut edges.

FigureFigure 4.4. DemonstrationDemonstration ofof thethe min-cutmin-cut partitioningpartitioning algorithmalgorithm usingusing aa simplesimple exampleexample circuit.circuit.

2.2.2.2.2.2. PartitioningPartitioning UsingUsing InitialInitial AnalyticalAnalytical QuadraticQuadratic PlacementPlacement WhileWhile thethe min-cutmin-cut algorithmalgorithm considersconsiders thethe connectionsconnections betweenbetween thethe cells,cells, itit failsfails toto includeinclude anyany informationinformation regardingregarding howhow thethe cellscells wouldwould bebe placedplaced inin thethe physicalphysical design.design. ThisThis informationinformation cancan bebe derivedderived fromfrom thethe outputoutput ofof thethe placementplacement algorithmsalgorithms thatthat usuallyusually useuse wirewire length length as as a constrainta constraint to optimizeto optimize the th placemente placement of the of cells.the cells. To this To end, this this end, work this proposeswork proposes to use to the use output the output of an of analytical an analytical quadratic quadratic placer placer as a discriminatingas a discriminating feature fea- forture partitioning. for partitioning. Using Using the AQ the algorithm AQ algorithm would would allow allow the tier the partitioning tier partitioning to take to intotake accountinto account the location the location of the of cells the after cellsplacement, after placement, the total the area total of area the tiersof the after tiers partitioning after parti- connectionstioning connections between between the cells, the and, cells, implicitly, and, implicitly, the connections the connections between thebetween cells, heedingthe cells, threeheeding of the three main of the four main considerations. four considerations. ThisThis sectionsection describesdescribes thethe detailsdetails ofof thethe AQAQ partitioner,partitioner, whichwhich dividesdivides thethe netlistnetlist intointo NN equalequal parts.parts. TheThe AQAQ algorithmalgorithm worksworks byby assigningassigning eacheach standardstandard cellcell inin thethe gate-levelgate-level Electronics 2021, 10, x FOR PEER REVIEWnetlistnetlist into one specific specific tier tier out out of of the the overa overallll N Ntiers. tiers. The Theprincipal principal steps steps of the of AQ the6 parti- of AQ 16

partitioningtioning system system are areoutlined outlined in inFigure Figure 5.5 Wi. Withth the the help help of of a placementplacement algorithm,algorithm, placed placed cells’cells’ physicalphysical locationslocations couldcould bebe usedused asas aa discriminativediscriminative metric metric for for tier tier assignment. assignment.

FigureFigure 5.5. AQ partitioning pipeline.pipeline.

AsAs shownshown in Figure5 5,, the the tiertier assignmentassignment startsstarts withwith performingperforming anan initialinitial AQAQ place-place-

mentment onon thethe given given netlist. netlist. This This 2D 2D placement placement finds finds the the location location (x, y)(x, of y) each of each cell tocell minimize to mini- themize total the wiretotal lengthwire length of the of placed the placed netlist. netlist. In its In standard its standard implementation, implementation, the initial the initial step ofstep AQ of is AQ followed is followed by iterative by iterative partitioning, partitioning, placement, placement, and legalization and legalization [32]. However, [32]. How- in thisever, work, in this this work, initial this step initial is taken step is as taken a representation as a representation of the placed of the netlistplaced since netlist it since can be it performedcan be performed fast and fast is and efficient is efficient in terms in term of spreadings of spreading the cells. the cells. In order In order to implement to implement the initialthe initial AQ placement,AQ placement, a new a new temporary temporary net is net inserted is inserted between between each paireach of pair connected of connected cells, creatingcells, creating the fully the connectedfully connected clique clique model model of the of netlist. the netlist. The connection The connection matrix matrix of all of cells all andcells pins and arepins then are extractedthen extracted using theusing weights the we ofights each of connection each connection within thewithin clique themodel. clique model. The optimum location for each cell placement is then calculated by setting the de- rivative of the total quadratic wire length to zero. As seen in Figure 6, having performed the initial AQ, the placed cells are ranked based on their location on the x-axis, with the cells on the left side of the placed netlist having higher ranks than the cells on the right side. This partitioning approach is inspired by the recursive partitioning step of the AQ placement algorithm. To partition the netlist, cells in the ranking queue are assigned a specific tier (starting at the lowest tier) until the tier’s area limitation is reached. At this point, the tier is changed to the next higher one.

Figure 6. Example of tier partitioning. The red and yellow rectangles in the third window represent cells assigned to different tiers.

After the partitioning is performed and each cell is assigned a tier, there exist nets that have been spread across multiple tiers. In order to keep these connections intact, MIVs are inserted in the severed nets as an extra cell. The details of the MIV insertion process are described in detail in the next section.

2.3. MIV Insertion Vertical interconnects such as MIVs are used to carry the signal from one tier to the next. The general representation of an MIV is shown in Figure 7, along with its equivalent circuit model. It consists of two contact resistances and a wire resistance in series with components with the substrate. Therefore, the MIV could be modelled as a re- sistor in series with a capacitance in parallel. The size of an MIV is negligible compared to the size of typical standard cells, so there is no density cap placed on the MIV placement. The overall resistance and capacitance of an MIV could be assumed to be 2 Ω and 0.1 fF, respectively, with an approximate diameter of 100 nm [11]. The approximate size and par- asitic capacitance of the MIVs are comparable to the antennae standard cell, which

Electronics 2021, 10, x FOR PEER REVIEW 6 of 16

Figure 5. AQ partitioning pipeline.

As shown in Figure 5, the tier assignment starts with performing an initial AQ place- ment on the given netlist. This 2D placement finds the location (x, y) of each cell to mini- mize the total wire length of the placed netlist. In its standard implementation, the initial step of AQ is followed by iterative partitioning, placement, and legalization [32]. How- ever, in this work, this initial step is taken as a representation of the placed netlist since it can be performed fast and is efficient in terms of spreading the cells. In order to implement Electronics 2021, 10, 1930 the initial AQ placement, a new temporary net is inserted between each pair of connected6 of 15 cells, creating the fully connected clique model of the netlist. The connection matrix of all cells and pins are then extracted using the weights of each connection within the clique model.The optimum The optimum location location for each for cell each placement cell plac isement then calculated is then calculated by setting by the setting derivative the de- of rivativethe total of quadratic the total wirequadratic length wire to zero. length to zero. AsAs seen inin FigureFigure6 ,6, having having performed performed the th initiale initial AQ, AQ, the placedthe placed cells arecells ranked are ranked based basedon their on locationtheir location on the onx-axis, the x with-axis, the with cells the on cells the on left the side left of side the placed of the netlistplaced havingnetlist havinghigher rankshigher than ranks the than cells the on cells the righton the side. right This side. partitioning This partitioning approach approach is inspired is inspired by the byrecursive the recursive partitioning partitioning step of step the of AQ the placement AQ placement algorithm. algorithm. To partition To partition the netlist, the netlist, cells cellsin the in ranking the ranking queue queue are assigned are assigned a specific a specific tier (starting tier (starting at the at lowest the lowest tier) untiltier) theuntil tier’s the tier’sarea limitationarea limitation is reached. is reached. At this At point, this poin thet, tier the is tier changed is changed to the to next the highernext higher one. one.

Figure 6. ExampleExample of of tier partitioning. The The red red and and yellow yellow rectan rectanglesgles in in the the third window represent cells assigned to different tiers.

AfterAfter the partitioningpartitioning isis performedperformed and and each each cell cell is assignedis assigned a tier, a tier, there there exist exist nets nets that thathave have been been spread spread across across multiple multiple tiers. tiers. In order In order to keep to keep these these connections connections intact, intact, MIVs MIVs are areinserted inserted in thein the severed severed nets nets as anas extraan extra cell. cell. The The details details of the of the MIV MIV insertion insertion process process are aredescribed described in detail in detail in the in the next next section. section.

2.3.2.3. MIV MIV Insertion Insertion VerticalVertical interconnects such such as as MIVs MIVs are are used to carry the signal from one tier to the next.next. The The general general representation representation of an an MIV MIV is is shown shown in in Figure Figure 77,, alongalong withwith itsits equivalentequivalent circuitcircuit model. model. It It consists consists of of two two contact contact resist resistancesances and and a a wire wire resistance in series with capacitorcapacitor components components with with the the substrate. substrate. Ther Therefore,efore, the the MIV MIV could could be bemodelled modelled as a as re- a sistorresistor in inseries series with with a capacitance a capacitance in inparallel. parallel. The The size size of an of anMIV MIV is negligible is negligible compared compared to theto the size size of typical of typical standard standard cells, cells, so so there there is isno no density density cap cap placed placed on on the the MIV MIV placement. placement. TheThe overall overall resistance resistance and and capacitance capacitance of of an an MIV MIV could could be be assumed assumed to to be be 2 2 ΩΩ andand 0.1 0.1 fF, fF, respectively,respectively, with with an an approximate approximate diameter diameter of of100 100 nm nm [11]. [11 The]. Theapproximate approximate size and size par- and asiticparasitic capacitance capacitance of the of the MIVs MIVs are are comparable comparable to tothe the antennae antennae diode diode standard standard cell, cell, which which is already present in the technology library. Therefore, the antennae diode standard cell is used to simulate MIV cells. These cells are placed at the end of each of the nets, which are cut off by the tier-partitioning algorithm, as shown in Figure8. This insertion is also iteratively performed if the nets spread in more than two tiers, connecting the lowest tier to the highest tier that the net occupies. Once the netlist is partitioned and MIVs are connected to the severed nets, all tiers are complete and can be saved as a netlist. Therefore, the outputs of the partitioning algorithms are N separate netlists, which are passed to physical design software for the P&R flow.

2.4. P&R Flow Using ICC2 The MIV insertion step is followed by the placement and routing implementation carried out with the ICC2 EDA tool. The detailed steps of this process are shown in Figure9. The first step of the flow is the design initialization, which sets up the design environment and loads the technology library into the system. A 32 nm standard cell library was used to implement the flow, which was obtained from Synopsys®. The cell density of the different tiers of the 3D design are matched with that of the 2D design to aid in consistency and to achieve a fair result comparison. The design initialization is followed by the floor planning, ElectronicsElectronics 2021 2021, ,10 10, ,x x FOR FOR PEER PEER REVIEW REVIEW 77 ofof 1616

Electronics 2021, 10, 1930 7 of 15 isis alreadyalready presentpresent inin thethe technologytechnology library.library. Therefore,Therefore, thethe antennaeantennae diodediode standardstandard cellcell isis usedused toto simulatesimulate MIVMIV cells.cells. TheseThese cellscells areare placedplaced atat thethe endend ofof eacheach ofof thethe nets,nets, whichwhich areare cutcut offoff byby thethe tier-partitioningtier-partitioning algorithm,algorithm, asas shownshown inin FigureFigure 8.8. ThisThis insertioninsertion isis alsoalso whichiterativelyiteratively builds performedperformed the PG grid ifif thethe and netsnets the spreadspread core area. inin moremore Each thanthan tier twotwo of the tiers,tiers, design connectingconnecting consists thethe of lowest alowest total tier oftier ninetoto thethe metal highesthighest layers tiertier to thatthat accommodate thethe netnet occupies.occupies. the interconnects.

FigureFigureFigure 7. 7. 7.MIV MIV MIV model model model with with with equivalent equivalent equivalent resistance resistance resistance and and and capacitor. capacitor. capacitor.

Electronics 2021, 10, x FOR PEER REVIEW 8 of 16

FigureFigureFigure 8. 8. 8.MIV MIV MIV insertion. insertion. insertion.

OnceOnce thethe netlistnetlist isis partitionedpartitioned andand MIVsMIVs araree connectedconnected toto thethe severedsevered nets,nets, allall tierstiers areare complete complete and and can can be be saved saved as as a a netlist. netlist. Th Therefore,erefore, the the outputs outputs of of the the partitioning partitioning algo- algo- rithmsrithms areare NN separateseparate netlists,netlists, whichwhich areare passepassedd toto physicalphysical designdesign softwaresoftware forfor thethe P&RP&R flow.flow.

2.4.2.4. P&RP&R FlowFlow UsingUsing ICC2ICC2 The The MIVMIV insertioninsertion stepstep isis followedfollowed byby thethe placementplacement andand routingrouting implementationimplementation carriedcarried out out with with the the ICC2 ICC2 EDA EDA tool. tool. The The detaile detailedd steps steps of of this this process process are are shown shown in in Figure Figure 9.9. TheThe firstfirst stepstep ofof thethe flowflow isis thethe designdesign initialization,initialization, whichwhich setssets upup thethe designdesign environ-environ- mentment andand loadsloads thethe technologytechnology librarylibrary intointo thethe system.system. AA 3232 nmnm standardstandard cellcell librarylibrary waswas usedused toto implementimplement thethe flow,flow, whichwhich waswas obtainedobtained fromfrom SynopsysSynopsys®®.. TheThe cellcell densitydensity ofof thethe differentdifferent tiers tiers of of the the 3D 3D design design are are matched matched with with that that of of the the 2D 2D design design to to aid aid in in consistency consistency andand toto achieveachieve aa fairfair resultresult comparison.comparison. TheThe designdesign initializationinitialization isis followedfollowed byby thethe floorfloor planning,planning, whichwhich buildsbuilds thethe PGPG gridgrid andand thethe corecore area.area. EachEach tiertier ofof thethe designdesign consistsconsists ofof aa totaltotal ofof ninenine metalmetal layerslayers toto accommodateaccommodate thethe interconnects.interconnects.

FigureFigure 9.9.P&R P&R designdesign flow.flow.

ThisThis isis followedfollowed byby thethe placementplacement ofof thethe cellscellsand andthe theclock clocktree tree synthesis synthesisto to remove remove timingtiming violations.violations. Finally,Finally, thethe cellscells areare connectedconnected inin thethe routingrouting step,step, followedfollowed byby somesome postroutingpostrouting optimizations.optimizations. TheseThese optimizationsoptimizations areare handledhandled automaticallyautomatically byby thethe P&RP&R tooltool and and includeinclude such such thingsthings as as fixingfixing timingtiming violationsviolations within within tiers, tiers, making making small small changes changes inin routingrouting toto optimizeoptimize wirewire lengths,lengths, powerpower optimizations,optimizations, postroutepostroute bufferbuffer removal,removal, etc.etc. The power, performance, and area results are obtained for the individual tiers and added together to give the overall power dissipation of the entire 3D design.

3. Results The entire design flow was executed on three different benchmark circuits: advanced encryption standard (AES), floating-point unit (FPU), and fast Fourier transform (FFT) circuits obtained from the opencores.org website (accessed on 5th June 2020). These circuits differ in size and density and are used to demonstrate the applicability of 3D designs on different circuit types. The properties of the benchmark circuits are given in Table 1.

Table 1. Benchmark circuits.

Benchmark Circuits Total Number of Cells Core Area at 70% Utilization Advanced encryption 13,921 48,241.360 standard (AES) Floating-point unit (FPU) 37,242 112,944.389 Fast Fourier transform (FFT) 105,790 581,766.113

Power dissipation results have been studied for the 2D and 3D layouts for the bench- mark circuits. Two-tier and three-tier models using the minimum-cut and analytical quad- ratic partitioning algorithms were developed and the reductions in power dissipation were compared with those from other similar studies [28,29]. Moreover, the four-tier and five-tier results were generated using the AQ partitioning for the benchmark circuits to show the trend in power dissipation as the total number of tiers in the 3D design were increased. The AQ algorithm was chosen for this analysis as it produces the best results. Table 2 shows the power dissipation results and the percentage decrease in total power normalized to the 2D design for different 3D layouts and tier partitioning algo- rithms. The results are visualized in Figure 10 and show that the AQ algorithm’s overall performance was better than that of the min-cut partitioning. The 3D results compared to the 2D counterparts were first analyzed. They show that the 3D implementation with the min-cut partitioning algorithm produced lower power dissipation for the AES and the

Electronics 2021, 10, 1930 8 of 15

The power, performance, and area results are obtained for the individual tiers and added together to give the overall power dissipation of the entire 3D design.

3. Results The entire design flow was executed on three different benchmark circuits: advanced encryption standard (AES), floating-point unit (FPU), and fast Fourier transform (FFT) circuits obtained from the opencores.org website (accessed on 5 June 2020). These circuits differ in size and density and are used to demonstrate the applicability of 3D designs on different circuit types. The properties of the benchmark circuits are given in Table1.

Table 1. Benchmark circuits.

Benchmark Circuits Total Number of Cells Core Area at 70% Utilization Advanced encryption 13,921 48,241.360 standard (AES) Floating-point unit (FPU) 37,242 112,944.389 Fast Fourier transform (FFT) 105,790 581,766.113

Power dissipation results have been studied for the 2D and 3D layouts for the bench- mark circuits. Two-tier and three-tier models using the minimum-cut and analytical quadratic partitioning algorithms were developed and the reductions in power dissipation were compared with those from other similar studies [28,29]. Moreover, the four-tier and five-tier results were generated using the AQ partitioning for the benchmark circuits to show the trend in power dissipation as the total number of tiers in the 3D design were increased. The AQ algorithm was chosen for this analysis as it produces the best results. Table2 shows the power dissipation results and the percentage decrease in total power normalized to the 2D design for different 3D layouts and tier partitioning algorithms. The results are visualized in Figure 10 and show that the AQ algorithm’s overall performance was better than that of the min-cut partitioning. The 3D results compared to the 2D counterparts were first analyzed. They show that the 3D implementation with the min- cut partitioning algorithm produced lower power dissipation for the AES and the FFT circuits while showing an increase in power dissipation for the FPU circuit. On the other hand, the results obtained with the AQ algorithm show a significant reduction in power for all the three benchmark circuits, with the AES circuit showing the most reduction in power dissipation for the two-tier models, while the FPU circuit showed the most power reduction for the higher-tier models. We also compared our results with those from other similar works [28,29], as shown in Figure 10. The FFT 3D comparison results were obtained from [28], where the 3TM metal layer structure was chosen as it produces the best results with no reduction in the metal dimensions, while the AES and the FPU comparison results were drawn from [29], using the T-MI 3D design with the 45 nm node. Only two-tier 3D models were available in these studies. Our two-tier 3D layout developed using the min-cut partitioner produced better results than the related works [28,29] for the AES and FFT circuits, while our model comprising the AQ partitioner did significantly better in all three benchmark circuits. The most likely cause for the better power results of our implementation compared to [28,29] was better partitioning algorithms. However, the exact technology libraries and layout configurations of [28,29] differ from our implementation. As a result, the comparison between the different methods applied in this work alone are more valid compared to those between our implementation and other related studies [28,29]. Electronics 2021, 10, 1930 9 of 15

Table 2. Power dissipation results of benchmark circuits. The negative values in the “% Reduction in Power normalized to 2D” column represent an increase in the power dissipation.

Total Power % Reduction in Power Circuit Structure Partitioning Algorithm Dissipation (mW) Normalized to 2D 2D - 1.95 - Min-cut 1.615 17.18% 3D (two tiers) AQ 1.215 37.69% AES Min-cut 1.552 20.41% 3D (three tiers) AQ 1.162 40.41% 3D (four tiers) AQ 1.142 41.44% 3D(five tiers) AQ 1.073 44.9% 2D - 5.79 - Min-cut 7.11 −22.80% 3D (two tiers) AQ 3.76 35.06% FPU Min-cut 7.31 −26.25% 3D (three tiers) AQ 3.04 47.50% 3D (four tiers) AQ 2.76 52.33% 3D(five tiers) AQ 2.485 57.08% 2D - 49.4 - Min-cut 39.3 20.45% 3D (two tiers) AQ 43.4 12.15% FFT Min-cut 39 21.05% 3D (three tiers) AQ 42.7 13.56%

Electronics 2021, 10, x FOR PEER3D REVIEW (four tiers) AQ 41.59 15.81% 10 of 16

3D(five tiers) AQ 43.6 11.71%

Figure 10. Reduction in power dissipation for different 3D3D layoutslayouts normalizednormalized toto thethe 2D2D values.values.

We nextnext analyzedanalyzed thethe reductionreduction inin powerpower dissipationdissipation obtainedobtained byby increasingincreasing thethe number of tiers ofof thethe 3D3D designdesign usingusing thethe AQAQ partitioner,partitioner, asas shownshown inin FigureFigure 1111.. TheThe results show a sharp decrease in power dissipation as we moved from the 2D layout to the results show a sharp decrease in power dissipation as we moved from the 2D layout to two-tier 3D model, with smaller incremental decreases as we moved to higher tiers. The the two-tier 3D model, with smaller incremental decreases as we moved to higher tiers. advantage of the 3D design in terms of reducing power dissipation is more prominent for The advantage of the 3D design in terms of reducing power dissipation is more prominent circuits that are more wire dominant in nature. This makes the shift from 2D to 3D layouts for circuits that are more wire dominant in nature. This makes the shift from 2D to 3D layouts more applicable for wire-intensive circuits than those that are more cell dominant. To analyze this fact further, the cell density and wire utilization plots for the three bench- mark circuits were generated and are shown in Figure 12. The cell density plots show the concentration of cells to be the highest in the FFT circuit, with those of AES and FPU being considerably lower. On the other hand, the wire utilization plots show the AES and the FPU circuits to have a more concentrated wire density than the FFT. This shows the FFT circuit to be much more cell dominant than the AES and the FPU, which are more wire dominant. Hence, the AES and FPU circuits are more suited for a 3D layout compared to the FFT circuit.

Figure 11. Power dissipation advantage for different tiers.

Electronics 2021, 10, x FOR PEER REVIEW 10 of 16

Figure 10. Reduction in power dissipation for different 3D layouts normalized to the 2D values.

We next analyzed the reduction in power dissipation obtained by increasing the number of tiers of the 3D design using the AQ partitioner, as shown in Figure 11. The results show a sharp decrease in power dissipation as we moved from the 2D layout to Electronics 2021, 10, 1930 the two-tier 3D model, with smaller incremental decreases as we moved to higher10 tiers. of 15 The advantage of the 3D design in terms of reducing power dissipation is more prominent for circuits that are more wire dominant in nature. This makes the shift from 2D to 3D layouts more applicable for wire-intensive circuits than those that are more cell dominant. Tomore analyze applicable this fact for further, wire-intensive the cell density circuits and than wire those utilization that are plots more for cell the dominant. three bench- To analyze this fact further, the cell density and wire utilization plots for the three benchmark mark circuits were generated and are shown in Figure 12. The cell density plots show the circuits were generated and are shown in Figure 12. The cell density plots show the concentration of cells to be the highest in the FFT circuit, with those of AES and FPU being concentration of cells to be the highest in the FFT circuit, with those of AES and FPU being considerably lower. On the other hand, the wire utilization plots show the AES and the considerably lower. On the other hand, the wire utilization plots show the AES and the FPU circuits to have a more concentrated wire density than the FFT. This shows the FFT FPU circuits to have a more concentrated wire density than the FFT. This shows the FFT circuit to be much more cell dominant than the AES and the FPU, which are more wire circuit to be much more cell dominant than the AES and the FPU, which are more wire dominant. Hence, the AES and FPU circuits are more suited for a 3D layout compared to dominant. Hence, the AES and FPU circuits are more suited for a 3D layout compared to the FFT circuit. the FFT circuit.

Electronics 2021, 10, x FOR PEER REVIEW 11 of 16

FigureFigure 11. 11. PowerPower dissipation dissipation advantage advantage for for different different tiers. tiers.

Figure 12. CellCell density density and and wire utilization plots for the different benchmarkbenchmark circuits.

Moreover, the FFT circuit has a large clump of wires localized in the lower-left corner, making it more unsuitable for 3D implementation. These trends are confirmed by the 3D results that show the AES and the FPU’s overall 3D performance to be much better than the FFT circuit. The FFT circuit also saturated faster and even experienced an increase in power dissipation as we expanded it to a five-tier design. In contrast, the other two circuits showed a steady increase up to this point. The cell density and wire utilization plots for the two-tier 3D design of the FFT and the AES benchmark circuits using the AQ partitioner are given in Figures 13 and 14, re- spectively. The FPU benchmark circuit showed a similar trend, so the FFT and AES cir- cuits are shown as examples. The plot shows even splitting of the cells among the two tiers. The plots show greater routing congestion for the FFT circuit compared to the AES circuit. Of note, the first tier of the FFT circuit showed very large wire utilization compared to the 2D design. This is confirmed by the power dissipation results, which show better 3D results for the AES circuit compared to the FFT circuit. The AES circuit, on the other hand, shows better wire utilization plots with very low wire utilization for the first tier and only two small hot spots in the second tier as well. This bolsters our earlier conclusion that more wire-dominant circuits like the AES have better 3D performance compared to more cell-dominant circuits like the FFT. However, in cell-dominant circuits like the FFT, there is significant difference between the wire utilization of the two tiers. This suggests that a partitioner that can better predict routing congestion would perform better for cell- dominant circuits compared to our proposed partitioning algorithms that are placement- based.

Electronics 2021, 10, 1930 11 of 15

Moreover, the FFT circuit has a large clump of wires localized in the lower-left corner, making it more unsuitable for 3D implementation. These trends are confirmed by the 3D results that show the AES and the FPU’s overall 3D performance to be much better than the FFT circuit. The FFT circuit also saturated faster and even experienced an increase in power dissipation as we expanded it to a five-tier design. In contrast, the other two circuits showed a steady increase up to this point. The cell density and wire utilization plots for the two-tier 3D design of the FFT and the AES benchmark circuits using the AQ partitioner are given in Figures 13 and 14, respectively. The FPU benchmark circuit showed a similar trend, so the FFT and AES circuits are shown as examples. The plot shows even splitting of the cells among the two tiers. The plots show greater routing congestion for the FFT circuit compared to the AES circuit. Of note, the first tier of the FFT circuit showed very large wire utilization compared to the 2D design. This is confirmed by the power dissipation results, which show better 3D results for the AES circuit compared to the FFT circuit. The AES circuit, on the other hand, shows better wire utilization plots with very low wire utilization for the first tier and only two small hot spots in the second tier as well. This bolsters our earlier conclusion that more wire-dominant circuits like the AES have better 3D performance compared to more cell-dominant circuits like the FFT. However, in cell-dominant circuits like the FFT, there is significant difference between the wire utilization of the two tiers. Electronics 2021, 10, x FOR PEER REVIEWThis suggests that a partitioner that can better predict routing congestion would perform12 of 16 better for cell-dominant circuits compared to our proposed partitioning algorithms that are placement-based.

Figure 13. CellCell density density and wire utilization plots for two-tier FFT 3D designs using the AQ partitioner. The timing analysis data for the 2D and two-tier 3D designs of the AES benchmark circuit are shown in Table3. Four different corner configurations were considered in the design. They are fast-fast at 125 ◦C, fast-fast at −40 ◦C, slow-slow at 125 ◦C, and slow-slow at −40 ◦C. The critical path length, slack, and clock period for the different corners are given in Table3 along with the number of violating paths and the total negative slack for the 2D and two-tier 3D AES circuits. The timing violation statistics for the AES 2D and two-tier 3D designs are shown in Table4. Both the 2D and 3D designs showed negligible violating nets compared to the total number of nets. The data shows similar timing performance for the 2D and the 3D circuits, so both the 2D and the two-tier 3D designs could be realized at a similar .

Figure 14. Cell density and wire utilization plots for two-tier AES 3D designs using the AQ partitioner.

The timing analysis data for the 2D and two-tier 3D designs of the AES benchmark circuit are shown in Table 3. Four different corner configurations were considered in the design. They are fast-fast at 125 °C, fast-fast at −40 °C, slow-slow at 125 °C, and slow- slow at −40 °C. The critical path length, slack, and clock period for the different corners are given in Table 3 along with the number of violating paths and the total negative slack for the 2D and two-tier 3D AES circuits. The timing violation statistics for the AES 2D and two-tier 3D designs are shown in Table 4. Both the 2D and 3D designs showed negligible violating nets compared to the total number of nets. The data shows similar timing per- formance for the 2D and the 3D circuits, so both the 2D and the two-tier 3D designs could be realized at a similar frequency.

Electronics 2021, 10, x FOR PEER REVIEW 12 of 16

Electronics 2021, 10, 1930 12 of 15

Figure 13. Cell density and wire utilization plots for two-tier FFT 3D designs using the AQ partitioner.

Figure 14. Cell density and wire utilization plots for two-tier AES 3D designs using the AQ partitioner.

Table 3. TimingThe analysistiming resultsanalysis for data 2D and for two-tierthe 2D 3Dand AES two-tier benchmark 3D designs circuit. of the AES benchmark circuit are shown in Table 3. FourCritical different Path corner configurationsNumber of were considered in the Corners Critical Path Length Critical Path Slack Total Negative Slack design. They are fast-fast at 125 Clock°C, fast-fast Period at −40 Violating°C, slow-slow Paths at 125 °C, and slow- slow at −40 °C. The criticalAES_2D path length, slack, and clock period for the different corners ff_125c 0.74are given in Table 3.19 3 along with the number 4 of violating paths 0 and the total negative 0 slack ff_m40c 0.49for the 2D and two-tier 3.39 3D AES circuits. 4 The timing violation 0 statistics for the AES 0 2D and ss_125c 3.75two-tier 3D designs 0 are shown in Table 4 4. Both the 2D and 03D designs showed 0negligible violating nets compared to the total number of nets. The data shows similar timing per- ss_m40c 3.77 −0.02 4 4 −0.03 formance for the 2D and the 3D circuits, so both the 2D and the two-tier 3D designs could AES_3D_Tier1 be realized at a similar frequency. ff_125c 0.66 3.31 4 0 0

ff_m40c 0.48 3.49 4 0 0 ss_125c 3.88 0.01 4 0 0 ss_m40c 3.68 0 4 1 0 AES_3D_Tier0 ff_125c 0.71 3.26 4 0 0 ff_m40c 0.5 3.48 4 0 0 ss_125c 3.75 0 4 0 0 ss_m40c 3.81 −0.01 4 1 −0.01

Table 4. Timing violation information for the 2D and two-tier 3D benchmark circuits.

Circuit Total Nets Nets with Violations Max Trans Violations Max Cap Violations AES_2D 14,299 33 13 32 AES_Tier1 7790 7 1 7 AES_Tier0 8194 10 2 10 FPU_2D 39,256 20 8 20 FPU_Tier1 20,944 62 27 57 FPU_TIer0 20,422 68 33 65 FFT_2D 114,197 242 121 240 FFT_Tier1 65,568 202 134 202 FFT_Tier0 52,658 61 9 61 Electronics 2021, 10, 1930 13 of 15

However, our flow could only perform timing analysis on the nets that were confined to a single tier in the case of the 3D design and did not consider nets that travel between tiers. But a small portion of the overall nets travel in between tiers (less than 20% for the benchmark circuits), so the timing analysis of the 3D design could be considered as an approximation to the real results.

4. Discussion Several further modifications could be made to increase the accuracy of the 3D model. These include implementing P&R of all the tiers together in the EDA tool instead of doing them separately. This would result in more accurate timing analysis and clock tree synthesis of the circuits. Another improvement to our current method could be achieved by implementing more detailed MIV modeling. In addition, different MIV models could be applied, and the results compared to investigate the effect of different types of MIVs on the performance of 3D ICs. Furthermore, the P&R flow could be modified to make sure that the connecting MIV cells in different tiers line up in the exact same physical locations in each tier to achieve more accurate results. The alignment of MIVs would ensure that cells on the partitioning border that were previously adjacent remain in close vicinity after tier partitioning, alleviating the problem of long routes between such cells. Moreover, the relationship between the MIV parasitic and the fan-in and fan-out of the logic gates should be analyzed in detail. These improvements and modifications are left as further work. The time period of the 2D design and the individual tiers of the 3D design are set to be equal and only a small fraction of the nets go through the MIVs (about 18% for the two-tier AES design), so the 2D and the 3D designs are approximately iso performance. The implemented flow ensures that there are no timing violations among the nets that travel within tiers, but the flow does not have any mechanism to ensure that there are no timing violations in the fraction of the nets that travel in between tiers, and this accounts for a limitation in the study. However, the partitioner puts nets that are closely connected together in the same tier, so there are limited chances of timing violations being present in the inter-tier nets, and we leave the modification of the flow to include timing analysis for the inter-tier nets as future work. The AQ partitioning runtime on the three benchmark circuits took less than 5 min each on a computer running on a standard Intel core i7 processor with 128GB of RAM. This shows that the AQ partitioning algorithm scales well and could be implemented easily on large industrial circuits as well.

5. Conclusions The continuing trend in the reduction of the size of ICs is causing significant problems in the interconnect parasitic capacitance and resistance, which could be alleviated by moving to the 3D layout architecture. A complete design flow for the physical design of 3D ICs has been constructed in this study, which involves synthesizing the gate-level netlist from the RTL netlist, followed by tier partitioning and finally implementing the P&R flow. The complete flow has been implemented on three benchmark circuits, which show significant reduction in the power dissipation of the 3D design compared to the 2D counter- parts. Two different partitioning algorithms were implemented to assign the standard cells into different tiers. The AQ partitioning algorithm performed better on average compared to the min-cut algorithm on the three benchmark circuits that were used in the study. The two-tier 3D flow power reduction compared to the 2D designs for the AES, FPU, and FFT benchmark circuits were 37.69%, 35.06%, and 12.15%, respectively, using the AQ partitioning algorithm. The power reduction steadily increased as the number of tiers for the 3D design was incremented. The steady decrease in the power dissipation was the most prominent for the FPU circuit and the least pronounced in the FFT circuit. Detailed analysis about the power reduction for the 3D architecture revealed that wire-dominant circuits like the AES are more applicable for 3D implementation than cell-dominant circuits like the FFT. The cell density and wire utilization heatmaps for the two-tier 3D designs for the FFT Electronics 2021, 10, 1930 14 of 15

and the AES circuits using the AQ partitioner show that the cells are evenly distributed among the tiers for both the AES and the FFT circuits, whereas the wire utilization plot for the FFT circuit shows more hot spots compared to the AES circuit. This observation is supported by the power dissipation results, which show better 3D performance for the AES circuit compared to the FFT circuit. Finally, the timing analysis and violation data for the 2D and two-tier 3D designs show comparable timing performance between the 2D and 3D designs, which ensures the designs are iso performance. Of note, the 3D flow implemented in this work could only perform timing analysis for the nets that were restricted within a single tier of the 3D design, but only a small fraction of the nets travel in between the tiers so the reported timing data are very close to the real values.

Author Contributions: A.T. and J.L. designed the overall P&R flow and wrote the manuscript; M.S. implemented the AQ partitioning algorithm; Q.A. implemented the min-cut algorithm; J.-s.Y. edited and reviewed the manuscript and supervised the work. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported in part by . Acknowledgments: Special thanks to Tokyo Electron for providing constructive insight and discussions. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Sun, Y.; Agostini, N.B.; Dong, S.; Kaeli, D. Summarizing cpu and gpu design trends with product data. arXiv 2019, arXiv:1911.11313. 2. Knickerbocker, J.U.; Andry, P.S.; Colgan, E.; Dang, B.; Dickson, T.; Gu, X.; Haymes, C.; Jahnes, C.; Liu, Y.; Maria, J.; et al. 2.5D and 3D technology challenges and test vehicle demonstrations. In Proceedings of the 2012 IEEE 62nd Electronic Components and Technology Conference, San Diego, CA, USA, 29 May–1 June 2012; pp. 1068–1076. [CrossRef] 3. Yoon, S.W.; Hoon, K.J.; Carson, F. Challenges and opportunity in 3D integration packaging. In Proceedings of the SEMICON Korea 2010 SEMI Technology Symposium (STS), Seoul, Korea, 3–5 February 2010. 4. Lancaster, A.; Keswani, M. Integrated circuit packaging review with an emphasis on 3D packaging. Integration 2018, 60, 204–212. [CrossRef] 5. Weerasekera, R.; Grange, M.; Pamunuwa, D.; Tenhunen, H.; Zheng, L.-R. Compact modelling of Through-Silicon Vias (TSVs) in three-dimensional (3-D) integrated circuits. In Proceedings of the 2009 IEEE International Conference on 3D System Integration, San Francisco, CA, USA, 28–30 September 2009; pp. 1–8. 6. Kim, D.H.; Topaloglu, R.O.; Lim, S.K. TSV density-driven global placement for 3D stacked ICs. In Proceedings of the 2011 International SoC Design Conference, Jeju, Korea, 17–18 November 2011; pp. 135–138. [CrossRef] 7. Ye, T.; Hou, L.; Zhang, S.; Wang, J.; Peng, X. TSV modelling in 3D IC thermoelectric simulation. In Proceedings of the 2017 IEEE 12th International Conference on ASIC (ASICON), Guiyang, China, 25–28 October 2017; pp. 678–681. [CrossRef] 8. Athikulwongse, K.; Kim, D.H.; Jung, M.; Lim, S.K. Block-level designs of die-to- bonded 3D ICs and their design quality tradeoffs. In Proceedings of the 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, 22–25 January 2013; pp. 687–692. [CrossRef] 9. Kim, D.H.; Athikulwongse, K.; Healy, M.B.; Hossain, M.M.; Jung, M.; Khorosh, I.; Kumar, G.; Lee, Y.-J.; Lewis, D.L.; Lin, T.-W.; et al. Design and Analysis of 3D-MAPS (3D Massively Parallel Processor with Stacked Memory). IEEE Trans. Comput. 2015, 64, 112–125. [CrossRef] 10. Jung, M.; Song, T.; Peng, Y.; Lim, S.K. Design Methodologies for Low-Power 3-D ICs With Advanced Tier Partitioning. IEEE Trans. Very Large Scale Integr. Syst. 2017, 25, 2109–2117. [CrossRef] 11. Song, T.; Panth, S.; Chae, Y.; Lim, S.K. More Power Reduction With 3-Tier Logic-on-Logic 3-D ICs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2016, 35, 2056–2067. [CrossRef] 12. Panth, S.; Samadi, K.; Du, Y.; Lim, S.K. Placement-Driven Partitioning for Congestion Mitigation in Monolithic 3D IC Designs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2015, 34, 540–553. [CrossRef] 13. Bamberg, L.; García-Ortiz, A.; Zhu, L.; Pentapati, S.; Shim, D.E.; Lim, S.K. Macro-3D: A Physical Design Methodology for Face-to-Face-Stacked Heterogeneous 3D ICs. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 37–42. [CrossRef] 14. Wang, S.; Yin, Y.; Hu, C.; Rezai, P. 3D Integrated Circuit Cooling with Microfluidics. Micromachines 2018, 9, 287. [CrossRef] [PubMed] 15. Yang, C.-C.; Hsieh, T.Y.; Huang, W.H.; Shen, C.H.; Shieh, J.M.; Yeh, W.K.; Wu, M.C. Recent progress in low-temperature-process monolithic three dimension technology. Jpn. J. Appl. Phys. 2018, 57, 04FA06. [CrossRef] Electronics 2021, 10, 1930 15 of 15

16. Pentapati, S.; Zhu, L.; Bamberg, L.; Shim, D.E.; García-Ortiz, A.; Lim, S.K. A Logic-on-Memory Processor-System Design With Monolithic 3-D Technology. IEEE Micro 2019, 39, 38–45. [CrossRef] 17. Chaudhuri, A.; Banerjee, S.; Park, H.; Kim, J.; Murali, G.; Lee, E.; Kim, D.; Lim, S.K.; Mukhopadhyay, S.; Chakrabarty, K. Advances in Design and Test of Monolithic 3-D ICs. IEEE Des. Test 2020, 37, 92–100. [CrossRef] 18. Samal, S.K.; Nayak, D.; Ichihashi, M.; Banna, S.; Lim, S.K. Monolithic 3D IC vs. TSV-based 3D IC in 14nm FinFET technology. In Proceedings of the 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), Burlingame, CA, USA, 10–13 October 2016; pp. 1–2. [CrossRef] 19. Nayak, D.K.; Banna, S.; Samal, S.K.; Lim, S.K. Power, performance, and cost comparisons of monolithic 3D ICs and TSV-based 3D ICs. In Proceedings of the 2015 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), Rohnert Park, CA, USA, 5–8 October 2015; pp. 1–2. [CrossRef] 20. Panth, S.; Samadi, K.; Du, Y.; Lim, S.K. Shrunk-2-D: A Physical Design Methodology to Build Commercial-Quality Monolithic 3-D ICs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2017, 36, 1716–1724. [CrossRef] 21. Panth, S.; Samadi, K.; Du, Y.; Lim, S.K. Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations. In Proceedings of the 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 1–5 June 2014; pp. 1–6. [CrossRef] 22. Perez-Rivera, Z.; Tlelo-Cuautle, E.; Champac, V. Gate Sizing Methodology with a Novel Accurate Metric to Improve Circuit Timing Performance under Process Variations. Available online: https://www.mdpi.com/2227-7080/8/2/25 (accessed on 8 April 2021). 23. Koneru, A.; Chakrabarty, K. Test and Design-for-Testability Solutions for Monolithic 3D Integrated Circuits. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, Tysons Corner, VA, USA, 9–11 May 2019. [CrossRef] 24. Zanelli, J.C.; Metzler, C.; Reis, R. Gate Sizing for Power-Delay Optimization at Transistor-level Monolithic 3D-Integrated Circuits. In Proceedings of the 2020 IEEE 11th Latin American Symposium on Circuits & Systems (LASCAS), San Jose, Costa Rica, 25–28 February 2020; pp. 1–4. [CrossRef] 25. Chang, K.; Sinha, S.; Cline, B.; Southerland, R.; Doherty, M.; Yeric, G.; Lim, S.K. Cascade2D: A design-aware partitioning approach to monolithic 3D IC with 2D commercial tools. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 7–10 November 2016; pp. 1–8. [CrossRef] 26. Jiang, J.; Parto, K.; Cao, W.; Banerjee, K. Ultimate Monolithic-3D Integration With 2D Materials: Rationale, Prospects, and Challenges. IEEE J. Electron Devices Soc. 2019, 7, 878–887. [CrossRef] 27. Park, H.; Chang, K.; Ku, B.W.; Kim, J.; Lee, E.; Kim, D.; Chaudhuri, A.; Banerjee, S.; Mukhopadhyay, S.; Chakrabarty, K.; et al. INVITED: RTL-to-GDS Tool Flow and Design-for-Test Solutions for Monolithic 3D ICs. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2–6 June 2019; pp. 1–4. 28. Lee, Y.; Lim, S.K. Ultrahigh Density Logic Designs Using Monolithic 3-D Integration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2013, 32, 1892–1905. [CrossRef] 29. Lee, Y.; Limbrick, D.; Lim, S.K. Power benefit study for ultra-high density transistor-level monolithic 3D ICs. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 29 May–7 June 2013; pp. 1–10. 30. Gilbert, J.R.; Zmijewski, E. A Parallel Graph Partitioning Algorithm for a Message-Passing Multiprocessor. Int. J. Parallel Program. 1987, 16, 427–449. [CrossRef] 31. Karypis, G.; Kumar, V. Parallel multilevel k-way partitioning scheme for irregular graphs. Siam Rev. 1999, 41, 278–300. [CrossRef] 32. Sigl, G.; Doll, K.; Johannes, F.M. Analytical placement: A linear or a quadratic objective function? In Proceedings of the 28th ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 17–22 June 1991. [CrossRef]