ELEC-H-541 Advanced manufacturing technologies and packaging

Part 01 Dragomir Milojevic [email protected] Today’s menu:

1. Course organization & overview

2. IC Structure

3. IC Manufacturing

4. Circuit design

5. Layout generation

6. Scaling principals and limitations

7. Overview of solutions to scaling limitations

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 2 1. Course organization & overview Lectures • According to schedule we fixed: Thursdays from 15.30 to 18.00 • 3 sessions fixed (should be enough, 1 extra lecture will be added if needed) • Lectures split in three parts: two of these are slides based, the third one is about the practical aspects of it (I will show some design enablement stuff) • We have time just to get a global picture, but this is HUGE topic and lot of research is done … (we can talk about this ...) • Material available on BEAMS website under ELEC-H-541: http://beams.ulb.ac.be/courses/ma2/advanced-manufacturing-technologies-packaging

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 4 Practical work • It is a personal work (we do not have the infrastructure to do the real thing ... yet) • Based on reading (and understanding!) of articles on a specific topic related to advanced IC design & manufacturing • Not only related to the IC manufacturing process, you can for example do you work on design enablement (how to design circuits with a specific technology) • Nice thing is that we can decide the topic together – it is always better to do things that you like :) • Will impose the topic only if we can not reach an agreement by a certain date! • We can talk this over after this first introductory course, once you get the picture of what all this is about ...

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 5 Exam • Oral presentation of the practical work ✦ Accounts for 70% of your final mark – so it is interesting to do a good job here ! ✦ Few slides to illustrate, the purpose is to convince me that you are able to master the topic • Plus 1 to 2 questions on the theory we have discussed here (also oral), you will typically have some time to prepare your answer (no access to lecture notes, smart phones, tablets, etc.) • You are going to be jugged according to the level of your technical competence, but also: ✦ For the quality of your presentation ✦ Capability to reason with the info you have (connecting things) ✦ Be able to draw important conclusions

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 6 Context of ELEC-H-541 course ELEC-H-505 Y-chart: circuit view axes, abstraction levels, and lectures ... ELEC-H-409

ELEC-H-305

ELEC-H-541

Here based on ULB program, but you should have similar background ...

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 7 Painting history

But we live in a 3D world: extension to the dimensions not used before ...

Flat surface No perspective

Similar thing is happening today Perspective but in IC industry !!! added

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 8 This part of the course: • Explains the way business has been run for past 40 years (important to understand from where we are coming) → basically why Moore’s law is there • Show limitations of the existing methods & procedures ✦ This is not going to be exhaustive, will point out some key issues • Demonstrate potential alternatives to overcome the limitations • This will open up a whole new world of possibilities ✦ Not only technology solutions, but also ✦ Enablers for eventual paradigm changes (architecture of future systems) ✦ How to design such systems

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 9 2. IC structure CMOS transistor • Three terminals device (gate, source, and drain): when “turned on”, the current flows from source to drain, when “turned off”, no current • NMOS – N-type dopant for the N N source and drain separated by poly poly gate gate Spacer Spacer Spacer Spacer NMOS P-type : it turns on when the −−Oxide + Oxide + Gate N N N N source drain source drain gate voltage is high P-well P-well Source Drain

(high voltage draws the Gate voltage low Gate voltage high Transistor symbol Transistor OFF Transistor ON negative carriers from the Transistor cross sections − +

source and drain under the P P poly poly gate gate gate) Spacer Spacer Spacer Spacer PMOS − Oxide + − Oxide + Gate P P P P source drain source drain PMOS – P-type dopant for N-well • N-well Source Drain

the source and drain Gate voltage low Gate voltage high Transistor symbol Transistor ON Transistor OFF Transistor cross sections separated by N-type : it 201 turns on when the gate voltageFigure 7-1isNMOS low and PMOS (low transistors. voltage draws the positive carriers from the source and drain under the gate)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 11 208 Chapter Seven

Supply

Valve off On Off Output Output Output high low pressure pressure Valve on Off On

Drain

Figure 7-7 Water circuit.

flows only where and when needed. These gates are used together to pro- duce more and more complicated behaviors until any logical function can be implemented. If we imagine electricity flowing like water, then tran- sistors would be the valves that control its flow, as in the imaginary water “circuit” shown in Fig. 7-7. CMOSIn thisgates circuit, the output starts at low pressure with no water in it. When the top valve is turned on and the bottom is turned off, current flows from the supply to the output bringing it to high pressure. If the • All gatestop valve are is built then turnedusing off NMOS and the andbottom PMOS valve turned transistors on, the water drains from the output bringing it back to low pressure. Electrical cur- • We haverent flows to proved through transistorswires to inconnect the same way,things changing together the voltage or electrical “pressure” of the wires. The water circuit in Fig. 7-7 is anal- • Inverterogous example to an electrical : one inverter NMOS shown and in Fig.PMOS 7-8. in series The supply is a high voltage and the electric ground acts to drain away ✦ Whencurrent. input A single is low input : NMOS wire is connected is off and to thePMOS gates ofon a PMOS→ high and output an (VddNMOS see transistor. the Vss Whenthough the thevoltage load on thegate) input is low (a logical 0), the PMOS transistor turns on and the NMOS turns off. Current flows ✦ When input is high : PMOS is off and the Vdd is cut-off from the Vss Supply 1 1

PMOS

Input Output 0 1 1 0 Low High High Low NMOS input output input output

Ground 0 0 Figure 7-8 Inverter operation. ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 12 Circuit Design 209

Inverter Inverter Inverter circuit schematic schematic symbol truth table In In Out In Out 0 1

Out 1 0

Figure 7-9 Inverter symbol and truth table. through the PMOS to bring the output to a high voltage. When the volt- age on the input is high (a logical 1), the PMOS transistor turns off and the NMOS turns on. The NMOS drains away the charge on the output returning it to a low voltage. Because the voltage on the output always changes to the opposite of the voltage on the input, this circuit is called an inverter. It inverts the binary signal on its input. For commonly used logic gates, circuit designers use symbols as short- hand for describing how the transistors are to be connected.210 The Chapter symbol Seven for an inverter is a triangle with a circle at the tip. The truth table for a logic gate shows what the output will be for every possibleschematic) combina- to the supply line. If either turns on, the output will be drawn tion of inputs. See Fig. 7-9. Because an inverter has onlyto one a 1. input, The N-transistors its are connected in series from the output to truth table need only show two possible cases. More complicatedthe ground logic line. Only if both turn on, the output will be drawn to a 0. gates and behaviors can be created with more transistors.The two inputs, A and B, are each connected to the gate of one P-transistor and one N-transistor. The second most commonly used logic gate is a NAND gate.Starting The atsim- the bottom left of Fig. 7-10, the NAND circuit is shown plest NAND has two inputs and provides a 0 output onlywith if both both inputs inputs at 0. Both N-devices will be off and both P-devices will are 1. A two-input NAND circuit and its operation are shownbe on,in Fig.pulling 7-10. the output to a 1. The middle two cases show that if only NANDThe NAND uses two and P-transistors NOR and two N-transistors.one of the inputs The is 0, one of the N-devices will be on, but it will be in P-transistors are connected in parallel from the output (labeledseries with X in the the other N-device that is off. A 0 signal is blocked from the output by the off N-device, and the single on P-device still pulls the output to a 1. At the bottom right, the final case shows that only if both inputs are high and both N-devices are on, the output is pulled to a 0. Similar to the NAND is the NOR logic gate. The two-input NOR and A B A B X The NMOS transistors are in series. its operation are shown in Fig. 7-11. X 0 0 1 Both mustThe be ON to pullNMOS the output low. transistors are in series. The NOR provides a 0 output only if either input is a 1. The NOR also 0 1 1 The PMOS transistors usesare in parallel.two P-transistors and two N-transistors. The P-transistors are A Both must be ON to pull the output low X 1 0 1 Either can be ON to pull connectedthe output high. in series from the output to the supply line. If both turn on, the output will be drawn to a 1. The N-transistors are connected in par- B 1 1 0 allel from the output to the ground line. If either turns on, the output will be drawn to aThe 0. The PMOStwo inputs are transistors each connected to theare gate in of one P-transistor and one N-transistor. X = 1 X = 1 X = 1 ShownX at= 0the bottomparallel, left of Fig. 7-11,either if both inputscan are be 0, both ON N-devices to pull will be off and both P-devices will be on, pulling the output to a 1. In the A = 0 A = 0 A = 1 A = 1 two center cases, theif only outputone of the inputs high is 0, one of the P-devices will be on, but it will be in series with the other P-device, which is off. A1 signal B = 0 B = 1 B = 0 B = 1 Figure 7-10 NAND gate.

A B X AB The NMOS transistors are in parallel. 0 0 1 The NMOS transistors are in Either can be ON to pull the output low. 0 1 0 X 1 0 0 The PMOS transistors are in series. parallel, either can be ON to pull X Both must be ON to pull the output high. the output low A B 1 1 0

The PMOS transistors are in series, both must be ON to pull X = 1 X = 0 X = 0 X = 0 the output high A = 0 B = 0 A = 0 B = 1 A = 1 B = 0 A = 1 B = 1 Figure 7-11 NOR gate. ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 13 Integrated Circuit – IC • Set of logic gates implemented using NMOS and PMOS on a plane of semiconductor material () • Hence the name : Complementary Metal–Oxide– Semiconductor → CMOS • Plan and CMOS → Planar (CMOS) IC technologies • Feature size (e.g. line/channel width) ✦ 2008 ~ 100nm

✦ Current : 28, 22nm

✦ 2015 ~ 10nm • This is SMALL …

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 14 Cross-section of an IC

Layered structure, each layer is processed in sequential manner (one after another) in a manufacturing line (hence the names bellow) • Substrate – support for mechanical handling of the IC (thick) • Front End Of Line – FEOL because processed first in line Active layer, contains transistors used to build gates, this is thin • Back End Of Line (BEOL) because processed last in line ✦ Metal layers ✦ Via layers

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 15 Metal layers • Close correlation between the number of connections between (number of inputs/outputs) and the block complexity (the number of gates inside the block) : Rent’s rule • Empirical rule based on observation of real circuits • Plotting on log-log scale of: ✦ the number of pins (terminals) T ✦ vs. the number of components (gates) K gives a strait line!

• Can be easily modeled with the expression: p where p is the Rent exponent T = AK • Meaning: more gates you have more interconnect will be needed ... ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 16 Metal layers • Typical IC will therefore use a metal stack, an assembly of multiple metal layers that will implement all required connections (both for signal and power supply) • Each metal layer has preferable routing direction (can be horizontal or vertical, but router could use both) and the routing can be Manhattan or 45º (Can you explain the difference?) • Different metal layers have different wire properties (geometrical → electrical), typically clustered in: ✦ Local ✦ Semi-global ✦ Global

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 17 chapter4.fm Page 112 Monday, September 6, 1999 1:44 PM

chapter4.fm Page 108 Monday, September 6, 1999 1:44 PM

112 THE WIRE Chapter 4

108 THE WIRE Chapter 4 Example 4.1 Capacitance of Metal Wire In actuality, this model is too simplistic. To minimize the resistance of the wires Some global signals, such as clocks, are distributed whileall over scaling the technology,chip. The itlength is desirable of those to keep the cross-section of the wire (W×H) as wires can be substantial. For die sizes between 1 and 2large cm, as wires possible can —reach as will a length become of apparent10 cm in a later section. On the other hand, small and have associated wire capacitances of substantial value.values Considerof W lead anto denseraluminum wiring wire and of less 10 area overhead. As a result, we have over the cm long and 1 µm wide, routed on the first Aluminumyears layer. witnessed We can acompute steady reduction the value in ofthe the W/H-ratio, such that it has even dropped below total capacitance using the data presented in Table 4.2.unity in advanced processes. This is clearly visible on the process cross-section of FIG- URE. Under those circumstances, the parallel-plate model assumed above becomes inac- Area (parallel-plate) capacitance: (0.1 × 10curate.6 µm 2The) × 30capacitance aF/µm2= between 3 pF the side-walls of the wires and the substrate, called the fringing capacitance, can no longer be ignored and contributes to the overall capacitance. 6 This effect is illustrated in Figure 4.4a. Presenting an exact model for this difficult geome- Fringingchapter4.fm capacitance: Page 107 Monday, September 6, 1999 2 1:44 × PM (0.1 × 10 µm) × 40 aF/µm = 8 pF Total capacitance: 11 pF

Section 4.3 Interconnect Parameters — Capacitance, Resistance, and Inductance 107 Notice the factor 2 in the computation of the fringing capacitance, which takes cthefringe two sides tor model (also called area capacitance). Under those circumstances, the total capacitance of the wire into account. of the wire can be approximated as1 (a) Fringing fields chapter4.fm Page 108 Monday, September 6, 1999 1:44 PM εdi Suppose now that a second wire is routedcint = alongside------WL the first one,(4.1) separated by only the tdi minimum allowed distance. Fromwhere W andTable L are respectively 4.3, the wewidth andcan length determineof the wire, and tdi and thatεdi represent this wire will couple to the thickness of the dielectric layer and its permittivity. SiO2 is the dielectric material of choice in integrated circuits, although some materials with lower permittivity, and hence the first with a capacitance equallower capacitance,to are coming in use. Examples of the latter are organic polyimides and - cpp aerogels. ε is typically expressed as the product of two terms, or ε = εrε0. ε0 = 8.854 × 10 12 108 THE WIRE Chapter 4 F/m is the permittivity6 of free space, and εr the relative permittivity of the insulating material. Table 4.1 presents the relative permittivity of several dielectrics used in inte- w Cintergrated =(0.1 circuits. × In summary,10 µthem) important × message95 aF/ from µEq.m (4.1) = is that9.5 the capacitancepFH is proportional to the overlap between the conductors and inversely proportional to their In actuality, this model is too simplistic. To minimize the resistance of the wires separation. which is almost as large as the total capacitance to ground! while scaling technology, it is desirableFigure to keep 4.4 theThe cross-section fringing-field of the wire (W×H) as Table 4.1 Relative permittivity of some typical dielectric materials. large+ as possible — as will become apparentcapacitance. in a later The section.model decomposesOn the other thehand, small A similar exercise shows that moving the wire to Al4 would reduce thevalues capacitance of W lead to todenser wiring and less area overhead. As a result, we have over the Material εr capacitance into two contributions: a years witnessed a steady reduction in the W/H-ratio, such that it has even dropped below ground to 3.45 pF (0.65 pF area andFree space 2.8 pF 1 fringe), while the interwire capacitance would parallel-plate capacitance, and a fringing Aerogels ~1.5 unity in advanced processes. This is clearly visible on the process cross-section of FIG- capacitance, modeled by a cylindrical wire remain approximately the same atPolyimides 8.5 (organic) pF. 3-4 URE. Under those circumstances, the parallel-plate model assumed above becomes inac- Metal Siliconlayers: dioxide 3.9 properties with a diameter equal to the thickness of curate. The capacitance between the side-walls of the wires and the substrate, called the Glass-epoxy (PC board) 5 the wire. Silicon Nitride (Si3N4) 7.5 fringing capacitance, can no longer be ignored and contributes to the overall capacitance. Alumina (package) 9.5 (b) Model of fringing-fieldThis capacitance. effect is illustrated in Figure 4.4a. Presenting an exact model for this difficult geome- Silicon 11.7

Current flow

4.3.2 Resistance L try is hard. So, as good engineering practice dictates, we will use a simplified model that cfringe W Electrical-fieldapproximates lines the capacitance as the sum of two components (Figure 4.4b): a parallel-plate capacitance determined by the orthogonal field between(a) Fringing a wire fields of width w and the ground The resistance of a wire is proportionalH to its length L and inversily proportional to its tdi Dielectric plane, in parallel with the fringing capacitance modeled by a cylindrical wire with a cross-section A. The resistance of a rectangular conductorFigure 4.3 inParallel-plate the capacitance style of Figure 4.3 can be Substrate dimension equal to the interconnect thickness H. The resulting approximation is simple model of interconnect wire. c expressed as and works fairly well in practice. pp 1 To differentiate between distributed (per unit lenght) wire parameters versus total lumped values, we w will use lowercase to denote the former and uppercase for the latter. H wε 2πε ρL ρL c = c + c = ------di- + ------di - Figure 4.4 The fringing-field(4.2) R = ------= ------wire pp fringe(4.3) t+ log t H capacitance. The model decomposes the A HW di ( di ⁄ ) capacitance into two contributions: a parallel-plate capacitance, and a fringing with w = W - H/2 a good approximation for the width of the parallel-platecapacitance, modeled capacitor. by a cylindrical wire where the constant ρ is the resistivity of the material (inNumerous Ω-m). more The accurate resistivities models (e.g.of some [Vdmeijs84]) have been developedwith a diameter over equal time, to the thickness but of Table 2 Example 180nm, trade-off R and C to compensate for lengththe wire. commonly-usedDimensions conductive• for Our materials Example 0.18- arem tabulated Technology inthese Table tend 4.4. to be Aluminum substantially ismore the(b) Model complex, inter- of fringing-field and capacitance. defeat our goal of developing a concep- connect material most often used in integrated circuits tualbecause understanding. of its low cost and its com- To illustrate the importancetry is hard. So,of asthe good fringing-field engineering practice component, dictates, we Figure will use a4.5 simplified plots modelthe that patibility with the standard integrated-circuit fabrication process. Unfortunately,approximates it thehas capacitance a as the sum of two components (Figure 4.4b): a parallel-plate value of the wiring capacitancecapacitance as a determinedfunction byof the(W orthogonal/H). For fieldlarger between values a wire of of(W width/H) w the and total the ground large resistivity compared to materials such as Copper.capacitance With ever-increasing approaches theplane, parallel-plateperformance in parallel with model. the fringing For ( W capacitance/H) smaller modeled than by 1.5, a cylindrical the fringing wire with a targets, this is rapidly becoming a liability and top-of-the-line processes aredimension now increas- equal to the interconnect thickness H. The resulting approximation is simple and works fairly well in practice. ingly using Copper as the conductor of choice.. wεdi 2πεdi cwire = cpp + cfringe = ------+ ------(4.2) tdi log (tdi ⁄ H) Table 4.4 Resistivity of commonly-used conductors (at 20 C). with w = W - H/2 a good approximation for the width of the parallel-plate capacitor. Numerous more accurate models (e.g. [Vdmeijs84]) have been developed over time, but Material ρ (Ω-m) these tend to be substantially more complex, and defeat our goal of developing a concep- tual understanding. ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541To illustrate the importance of the fringing-field component, Figure18 4.5 plots the Silver (Ag) 1.6 × 10−8 value of the wiring capacitance as a function of (W/H). For larger values of (W/H) the total capacitance approaches the parallel-plate model. For (W/H) smaller than 1.5, the fringing formance [1]. Such a strategy often requires miracles and, indeed, the 1997 roadmap forecast clock frequencies that will be difficult, if not impossible, to achieve. In our approach, we hedge our bets by making not a single prediction of technological scalings for wire characteristics but rather a range of predictions. We will use both aggressive and conservative scaling projections to bound future param- eters, and hope that by doing so, we will encompass a broad enough range that actual wire performance will fall within these bounds. In the discussions below, we will show results Fig. 12. FO4 scaling (typical, 90% , 125 C). for both aggressive and conservative scaling; not only does this give us a better chance of predicting future performance, it also helps us determine the sensitivity of these trends.

A. Gate Delay Scaling Historically, gates have scaled linearly with technology, and an accurate model of recent FO4 delays has been 360 pS at typical and 500 pS under worst case environmental conditions (typical devices, low , high temperature). Fig. 12 shows FO4 delays for a number of different process technologies running at the worst case Fig. 13. Historical FO4s per clock, 86 machines. environment corner. This trend may continue for future generations of transistors, since devices seem scalable down microarchitectures ranging from the nonpipelined 80 386 to to drawn dimensions of 0.018 m [28]. Whether or not such the out-of-order execution PentiumPro [33]. devices obey the above delay model is uncertain, because of Current machines cycle between 20 and 30 FO4s per issues in scaling gate oxide, and . These concerns clock, and the upcoming Pentium4 microarchitecture and mean 500 pS is a lower limit for future FO4 delays. the aggressive Compaq/DEC Alpha chips sit at around 14 Since we are considering wire delays relative to gate delays, to 16 FO4s per clock [34], [35]. This may look misleading, faster gates provide the worst case for wire issues, and thus since the Pentium4 processor, at 1.4 GHz in an 0.18 m we will use this model as our gate delay projection. process, would appear to have a cycle time of 714 pS and Other device parameters, such as gate and diffusion capac- an FO4 of 90 pS, or 8 FO4s per clock. However, some itance, are assumed to scale nicely. We assume gate capaci- technologies have an that is significantly smaller tance, now around 1.5–2 fF m, will stay constant; although than the base technology feature size. For example, the this would seem to demand too-thin gate oxides, high- di- 0.18- m process, because of poly profile engineering, electrics may allow more aggressive scaling of the effec- ends up with an of 100 nm [36]. (This is not the tive [29]–[31]. We project diffusion capacitance to stay same as saying that “electrical gate length is smaller than at about half gate capacitance for legged devices, although physical gate length,” since the narrowing of is due trench technologies and/or SOI can reduce this dramatically not to diffusion undercut, but rather to poly notches. In fact, [32]. the electrical gate length for this process will be smaller To predict chip clock cycle times under process scaling, still, although is irrelevant to our FO4 model, which we can examine the number of FO4s per cycle. Fig. 13 shows uses physical gate length.) Hence, our model would more some historical data from Intel microprocessors for various properly estimate FO4 delays for the Intel 0.18- m process

HO et al.: THE FUTURE OF WIRES 497 14/2/2004 2 Interconnect-Power Definition • Interconnect-Power is dynamic power consumption due to interconnect capacitance switching – How much power is consumed by Interconnections ? Via layers – Future generations trends ? – How to reduce the interconnect power ? • Contain vias – vertical connections between different levels of wires • Just like contacts in transistors they form vertical connections, but between wires instead of between wires and transistors 0.13 µm cross-section, source - Intel • Via1 layer creates connections between Metal1 & Metal2 wires above • A layer of insulation separates each wiring level so that connections are made between levels only where vias are drawn (metal 2 wires can cross over metal 1 without contact) • Later, another via1 could take the electrical signal back down to metal 1 where a contact could then connect to the appropriate transistor

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 19 Typical metal stack construction • Common: up to 9 metal layers + 9 via layers • Advanced ICs: between 9 and 14 • Each extra metal layer will add extra processing cost but also extra mask design cost (this can be expensive) • BEOL can be optimized for particular application, here density vs. performance • As the devices shrink, it becomes more and more complicated to make connections! (BEOL bottleneck)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 20 From concept to physical circuit

242 Chapter Eight

Metal 2 Via 1 Via 1

Metal 2 Contact Contact Metal 1

Metal1

Poly gate Transistor layout

Contact P-well Metal 1 Source Drain N+ Via1 Poly CADMetal 2 drawing of the active layer, P-well Transistor cross section Layoutshowingkey contacts and local wiring Figure 8-2 Transistor and interconnect layout. using M1and M2 (note preferred routing directions) Figure 8-2 shows the cross section and layout of a transistor with the first two metal interconnect layers added. The first layer created is the vertical contacts that will connectULB/BEAMS/MILOJEVIC the diffusion regions or Dragomir/ELEC-H-541poly gate to 21 the first layer of interconnect. These are columns of metal that form all the connections between wires and transistors. To aid in processing, all contacts are made of the same size. To contact wide transistors, multi- ple contacts are used. Above the contact layer is the first layer of wires, typically called metal 1 (M1). The metal 1 layer rests on top of a layer of insulation added on top of the transistors. This prevents it from touching the silicon surface or the poly layer. Metal 1 wires can create connections by crossing the hori- zontal distance between two contacts. The drain of one transistor can be connected to the gate of another by using a contact to go up from the drain to a metal 1 wire and then another contact to go down to the poly gate. However, metal 1 wires will always contact other metal 1 wires they cross. If the needed path is blocked by another metal 1 wire, the connection must be made using the next level of wiring. 3. IC manufacturing Lithography process • Fabrication of an integrated circuit (IC): a variety of physical and chemical processes performed on a semiconductor (e.g., silicon) substrate • Various processes: film deposition, patterning, and semiconductor doping • Films of both: ✦ conductors – such as polysilicon, aluminum, and more recently copper ✦ and insulators – various forms of silicon dioxide, silicon nitride, ... are used to connect and isolate transistors and their components • Selective doping of various regions of silicon allow the conductivity of the silicon to be changed with the application of voltage • By creating structures of these various components m(b)illions of transistors can be built and wired together to form the complex IC devices • Fundamental to all of these processes is lithography, i.e., the formation of three-dimensional relief images on the substrate for subsequent transfer of the pattern to the substrate on wafers

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 23 Chris Mack, Gentleman Scientist 19/11/13 14:12 Wafer processing

• Preparation – improves the adhesion of the photoresist material to the substrate (cleaning – to remove contamination; dehydration – to remove water; addition of an adhesion promoter) • Wafer coating with photoresist – with a thin, very uniform coating • Pre-bake – by removing the excess solvent, photoresist film becomes stable (will have better properties) Figure 1-1. Example of a typical sequence of lithographic processing steps (with no post-exposure • Align&expose – first the image mustbake be in alignedthis case), illustratedwith previously for a positive defined resist. patterns on the wafer, and the resulting overlay of the two or more lithographic patterns, is critical since tighter overlay control means circuit Adhesionfeatures promoters can be are packed used to react closer chemically together with surface (better silanol and replace the -OH group with an organic functional group that, unlike the hydroxyl group, offers good adhesion to photoresist. circuit density) Silanes are often used for this purpose, the most common being hexamethyl disilizane (HMDS) [1.2]. Development – Once exposed, the (Asphotoresist a note, HMDS mustadhesion be promotion developed was first developed for fiberglass applications, where adhesion • of the resin matrix to the glass fibers is important.) The HMDS can be applied by spinning a diluted solution (10-20% HMDS in cellosolve acetate, xylene, or a fluorocarbon) directly on to the wafer and • Pattern Transfer – after patterns areallowing printed the HMDSin photoresist, to spin dry (HMDS they is quite are volatile transferred at room temperature). into If the HMDS is not allowed the substrate using: subtractive transferto dry (etching),properly dramatic additive loss of adhesiontransfer will (selective result. Although direct spinning is easy, it is only effective at displacing a small percentage of the silanol groups. By far the preferred method of deposition), and impurity doping (ionapplying implantation). the adhesion Etchingpromoter is is by thesubjecting most the common substrate to pattern HMDS vapor, usually at elevated temperatures and reduced pressure. This allows good coating of the substrate without excess HMDS transfer approach. deposition, and the higher temperatures cause more complete reaction with the silanol groups. Once Strip resist – After pattern transfer theproperly remaining treated with photoresist HMDS the substrate must can bebe left removed for up to several days without significant re- • adsorption of water. Performing the dehydration bake and vapor prime in the same oven gives optimum performance. ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 24 2. Photoresist Coating

A thin, uniform coating of photoresist at a specific, well controlled thickness is accomplished by the seemingly simple process of spin coating. The photoresist, rendered into a liquid form by dissolving the solid components in a solvent, is poured onto the wafer, which is then spun on a turntable at a high speed producing the desired film. Stringent requirements for thickness control and uniformity and low defect density call for particular attention to be paid to this process, where a large number of parameters can have significant impact on photoresist thickness uniformity and control. There is the choice between static dispense (wafer stationary while resist is dispensed) or dynamic dispense (wafer spinning while resist is dispensed), spin speeds and times, and accelerations to each of the spin speeds. Also, the volume of the resist dispensed and properties of the resist (such as viscosity, percent solids, and solvent composition) and the substrate (substrate material and topography) play an important role in the resist thickness uniformity. Further, practical aspects of the spin operation, such as exhaust, temperature and humidity control, and spinner cleanliness often have significant effects on

http://www.lithoguru.com/scientist/lithobasics.html Page 3 of 11 Printing system: resolution

Chris Mack, Gentleman Scientist 19/11/13 14:12 • Projection lithography tools ✦ contact – simultaneous patterning of the complete wafer ✦ scanning – scans throughout the mask ✦ step-and-repeat systems – expose a part, called reticle, stepping out of the process Resolution, i.e. the smallest feature that can • Figure 1-5. Scanners and steppers use different techniques for exposing a large wafer with a small be printed with adequate controlimage isfield. limited by: ✦ the smallest image that can be projected onto the wafer, ✦ Resolution, the smallest feature that can be printed with adequate control, has two basic limits: the and the resolving capabilitysmallest of the image photoresist that can be projected to make onto the use wafer, of and that the resolving image capability of the photoresist to make use of that image. From the projection imaging side, resolution is determined by the wavelength Projected image resolution R ofis the determined imaging light (λ) by:and the numerical aperture (NA) of the projection lens according to the • Rayleigh criterion: ✦ k1 – process related factor R = k1 ✦ the wavelength of the imaging light (λ) NA Lithography systems have progressed from blue wavelengths (436nm) to UV (365nm) to deep-UV ✦ and the numerical aperture(248nm) (NA) to today’s mainstream high resolution wavelength of 193nm. In the meantime, projection tool numerical apertures have risen from 0.16 for the first scanners to amazingly high 0.93 NA systems today producing features well under 100nm in size.

ULB/BEAMS/MILOJEVICBefore the exposure Dragomir/ELEC-H-541 of the photoresist with an image of the mask can begin, 25this image must be aligned with the previously defined patterns on the wafer. This alignment, and the resulting overlay of the two or more lithographic patterns, is critical since tighter overlay control means circuit features can be packed closer together. Closer packing of devices through better alignment and overlay is nearly as critical as smaller devices through higher resolution in the drive towards more functionality per chip.

Another important aspect of photoresist exposure is the standing wave effect. Monochromatic light, when projected onto a wafer, strikes the photoresist surface over a range of angles, approximating plane waves. This light travels down through the photoresist and, if the substrate is reflective, is reflected back up through the resist. The incoming and reflected light interfere to form a standing wave pattern of high and low light intensity at different depths in the photoresist. This pattern is replicated in the photoresist, causing ridges in the sidewalls of the resist feature as seen in Figure 1-6. As pattern dimensions become smaller, these ridges can significantly affect the quality of the feature. The interference that causes standing waves also results in a phenomenon called swing curves, the sinusoidal variation in linewidth with changing resist thickness. These detrimental effects are best cured by coating the substrate with a thin absorbing layer called a bottom antireflective coating (BARC) that can reduce the reflectivity seen by the photoresist to less than 1 percent.

http://www.lithoguru.com/scientist/lithobasics.html Page 7 of 11 Lithography vs. Moore • To increase the resolution (reduce the feature size) ✦ Decrease wavelength ✦ Increase NA • Light sources: 435 → 248 → 193 and finally 157nm (~2010), • Still, plotted against Moore there is a breakdown ... • Wavelength reduction only is not enough to follow scaling requirements Supplementary tricks to print • Fig. 2. Comparison of lithography wavelength trends with IC Fig. 3. Example of using OPC serif features in contact hole feature size trend. Courtesy of Dr. S. Okazaki, Hitachi Ltd. printing. (top) Square feature on the mask prints as a circle at the features smaller then “light” wafer due to diffraction effects. (bottom) Serifs are added to make the corners of the printed image more square. ✦ Phase modulation or phase shifting canresist patternimprove is then the used effective for subsequent resolution process steps such of the system through constructive and/oras etching destructive or implantation interference doping. The optical in addition projection intensity image has less than full modulation for the small to the amplitude modulation systems used today have very complex multielement lenses features, the combination of high-contrast imaging material that correct for virtually all of the common aberrations and and good process (exposure dose) control can reliably operate at the diffraction limit. The resolution of a lithog- ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 26 produce subwavelength features. In this way, the improve- raphy system is usually expressed in terms of its wavelength ments in imaging resists have lowered the value for . It is and numerical aperture (NA) as interesting to note that while this is true for KrF lithography at 248 nm, it is not yet true for ArF exposures at 193 nm. Resolution (1) That is, the resist materials are not yet developed to the point NA of producing superior images even though the wavelength where the constant is dependent on the process being used. is smaller. Currently, the best lithographic performance is In IC manufacturing, typical values of range from 0.5 to seen at 248 nm. This also implies that unless resist materials 0.8, with a higher number reflecting a less stringent process. for 193 nm or shorter wavelengths such as 157 nm ( The NA of optical lithography tools ranges from about 0.5 to excimer) can be developed to a performance point equal to 0.6 today. Thus, the typical rule of thumb is that the smallest or better than that for 248-nm materials, continued feature features that can be printed are about equal to the wave- size shrinkage through wavelength reduction is not feasible. length of the light used. Historically, the improvements in Some compensation for the image degradation from IC lithography resolution have been driven by decreases in diffraction are possible by predistorting the mask features. the printing wavelength, as shown in Fig. 2. The illumina- A simple example is a correction for corner rounding by tion sources were initially based on mercury arc lamps fil- using serifs. The addition of subresolution features does tered for different spectral lines. The figure shows the pro- enhance the quality of the image on the wafer somewhat, gression from G-line at 435 nm to I-line at 365 nm. This was but requires the addition of these correction features on the followed by a switch to excimer laser sources with KrF at 248 mask, increasing its complexity and cost. An example of nm and, more recently, ArF at 193 nm. The most advanced using serifs to improve the printing fidelity of contact hole IC manufacturing currently uses KrF technology with intro- patterns is shown in Fig. 3. This approach is referred to as duction of ArF tools beginning sometime in 2001. It can also optical proximity effect correction (OPC) [2]. be seen from the figure that the progress in IC minimum fea- Image size reduction is an important factor in lithography. ture size is on a much steeper slope than that of lithography As stated above, the image of the mask is generally reduced wavelength. Prior to the introduction of KrF lithography, the by a factor of four or five when it is printed on the wafer. minimum feature sizes printed in practice have been larger The main reason for this is due to the mask-making process. than the wavelength with the crossover at the 250-nm gener- Masks are patterned by a scanned electron or laser beam ation and KrF. With the introduction of 180-nm technology primary pattern generator. The resolution and placement ac- in 1999, the most advanced IC manufacturing was done with curacy of the pattern generator are the basis for that of the feature sizes significantly below the wavelength (248 nm). optical printing system. Reduction imaging relaxes the re- The ability to print features significantly less than the quirements on the pattern generators. Thus, the specifica- wavelength of the exposure radiation can largely be at- tions for wafer lithography are generally four or five times tributed to improvements in the imaging resist materials. better than those of the pattern generators. In the regime Modern resists exhibit very high imaging contrast and act where feature sizes are printed that are less than the expo- as a thresholding function on the aerial image produced by sure wavelength, the process is highly nonlinear. In terms of the optical system. In other words, even though the light the mask, this introduces a complication referred to as mask

HARRIOTT: LIMITS OF LITHOGRAPHY 367 Practical considerations • requires high resolution, high sensitivity, precise alignment and low defect density • Advanced IC chip usually needs more than 30 patterning steps of which each one must align with the previous one precisely to successfully transfer the pattern of the chip design → lengthy process • Photolithography takes about 40–50% of the total wafer-processing time • In practice six to eight weeks! to fabricate bare wafers into finished wafers (as of 2001) • Solution: optimize & parallelize processing to increase the wafer throughput, but this increases the cost significantly ! ✦ 100.000 wafers/moth facility → 10 billion US$ (2012)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 27 Fabs • Wafers are processed in a manufacturing line in sequential manner: one step after another • Includes many machines and processing steps that require time (means cost, so process optimization is a very important key word) • Clean room – place where all the machines are installed (indicating vibration and particle contamination fee space) • Wafers have standardized size ✦ 200mm – old ✦ 300mm – common ✦ 450mm – new • Bigger the wafer, more ICs you produce in the same time (so less $/IC)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 28 Actors • Integrated device manufacturer (IDM) – does everything: design + manufacturing of the die + packaging + sales

✦ Intel, , ST etc… • Pure-play semiconductor foundry operates semiconductor fabrication plants to produce ICs for other companies (numbers indicate the number of employees in 2011)

✦ Big 3: TSMC (30,000) UMC (12,000) Globalfoundries(13,000) • IDMs will sometimes open the fab to other companies, but lot of strategy here (conflict of interest etc.)

✦ E.g. Samsung opens up a fab for Apple (and have filed lawsuit …)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 29 Lawsuit war ! • When put in graph • Lot of $ around ...

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 30 Example of a product ecosystem • Final product developers design and assemble products from 3rd party components (e.g. Apple iPhone 5s) • Fabless companies design ICs (RTL) by assembling proprietary IP blocks together with 3rd party blocks (e.g. Qualcomm Snapdragon IC uses ARM core, same goes for A7, see bellow) → They outsource IC manufacturing • Fabs (pure play) manufacture ICs • Packaging companies, assemble dies into PCB interfacable components • Integration ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 31 4. Circuit design Design flow

• Is about model transformation from higher Design to lower abstraction views of the IC RTL macro SDC ✦ Synthesis ✦ Floorplanning Tech 1. RTL Synthesis ✦ Standard placement and route .LIB/.LEF Inputs – high-level models: • 2. Clustering ✦ Design: RTL, constraints (SDC), soft/hard macros (accelerators, memories etc.) 3. Floorplanning ✦ Technology: Standard Cell Library Outputs – design properties and refined • 4. StdCell Place&Route models: ✦ Gate-level netlist Area, timing, power ✦ Placed&Routed design ✦ Layout → associated performance metrics

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 33 Steps of the design flow

1. Synthesis: From Register Transfer Level (RTL) to gate-level netlist

✦ SystemC – multiples levels of abstraction, some synthesizable not all

✦ Typical RTL languages : Verilog, VHDL

2. Clustering: gate-level netlist of a complex design= millions of gates, we do not want to solve Place&Route problem at that level → gates are “grouped” in clusters that can follow logical hierarchy, or we create a new physical hierarchy

3. Floorplanning: is about block placement (or placement of standard cell clusters) – this is how “big blocks” are organized between them

4. Place & Route: “optimal” cell position relative to the block placement is found

→ Placement in an NP hard problem, difficult to find best solution, rather we find good enough solutions that satisfies system constraints

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 34 Constraints • At synthesis level: typically timing/power (eventually area) ✦ No interconnect information yet, but there are predictive models for the routes that can be used to perform interconnect aware synthesis Area [um2]! • Timing constraint – trade-off area 20000! for speed: faster design = bigger 18000! 16000! area (Can you explain why?) 14000! 12000! • Figure – area increase due to 10000! 8000! the timing objective increase for 6000! two different technologies: 4000! older and newer 2000! 0! (Can you say which is which?) 0! 500! 1000! 1500! 2000! 2500! 3000! F[MHz]

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 35 Technology input • Standard Cell Library – text file provided by technology vendor • It is the list of cells (gates, pads, buffers etc.) used to build the design • Different information is needed at different steps of the design ✦ Synthesis – needs ๏ Area, pin description, functional description, timing and power information (some electrical properties), typically no wiring information ๏ Synthesis can be area/power aware ๏ Characterized for different voltage, temperature and process variation ✦ Place&Route – needs geometry information ๏ Cells – ports position, layer structure ๏ Routing resources (BEOL) – metal stack definition, geometry ๏ Electrical

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 36 cell (AND2_X1) { !drive_strength !: 1; !area !: 1.064000; .LIB !pg_pin(VDD) { Pin definition !!voltage_name : VDD; !!pg_type : primary_power;} !pg_pin(VSS) { !!voltage_name : VSS; !!pg_type : primary_ground;} ! cell_leakage_power !: 89.639611; ! leakage_power () { Leakage !!when : "!A1 & !A2"; !!value : 72.833875;} !leakage_power () { !!when : "!A1 & A2"; !!value : 106.528250;!} !pin (A1) { !!direction!!: input; !!related_power_pin!!: "VDD"; !!related_ground_pin!!: "VSS"; !!capacitance!!: 0.920426; !!fall_capacitance!: 0.919670; !!rise_capacitance!: 0.920426;} !pin (ZN) {

!!direction!!: output; !!related_power_pin!: "VDD"; !!related_ground_pin!: "VSS"; Function !!max_capacitance!!: 60.577400; !!function!!: "(A1 & A2)"; ... !timing () { !!!related_pin! : "A1"; Timing !!!timing_sense! : positive_unate; !!!cell_fall(Timing_7_7) { !!!!index_1 ("0.000932129,0.00354597,0.0127211,0.0302424,0.0575396,0.0958408,0.146240"); !!!!index_2 ("0.365616,1.893040,3.786090,7.572170,15.144300,30.288700,60.577400"); !!!!values ("0.0125200,0.0148532,0.0172855,0.0215407,0.0292629,0.0441369,0.0737366", \ !!!! "0.0399680,0.0433655,0.0468571,0.0526654,0.0622006,0.0783897,0.108440"); !!!}

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 37 CELLs MACRO AND2_X1 CLASS core ; Geometry info .LEF FOREIGN AND2_X1 0.0 0.0 ; ORIGIN 0 0 ; SYMMETRY X Y ; BEOL SITE FreePDK45_38x28_10R_NP_162NW_34O ; SIZE 0.76 BY 1.4 ; Metal layers PIN A1 DIRECTION INPUT ; + Via ANTENNAPARTIALMETALAREA 0.021875 LAYER metal1 ; PORT LAYER metal1 ; POLYGON 0.06 0.525 0.185 0.525 0.185 0.7 0.06 0.7 ; END END A1 LAYER metal1 PIN ZN TYPE ROUTING ; DIRECTION OUTPUT ; ANTENNADIFFAREA 0.109725 ; SPACING 0.065 ; PORT WIDTH 0.07 ; LAYER metal1 ; PITCH 0.14 ; POLYGON 0.61 0.19 0.7 0.19 0.7 1.25 0.61 1.25 ; DIRECTION HORIZONTAL ; END OFFSET 0.095 0.07 ; END ZN PIN VDD RESISTANCE RPERSQ 0.38 ; DIRECTION INOUT ; THICKNESS 0.13 ; USE power ; HEIGHT 0.37 ; SHAPE ABUTMENT ; CAPACITANCE CPERSQDIST 7.7161e-05 ; PORT EDGECAPACITANCE 2.7365e-05 ; LAYER metal1 ; END metal1 POLYGON 0 1.3 ; END END VDD LAYER via1 OBS TYPE CUT ; LAYER metal1 ; SPACING 0.08 ; POLYGON 0.235 0.84 0.47 0.84 0.47 0.46 0.045 0.46 0.045 0.19 ; WIDTH 0.07 ; END RESISTANCE 5 ; END AND2_X1 END via1

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 38 Floorplanning problem • Each red/yellow block is a hierarchical block of the design (here OpenSPARC core) • The big rectangle indicates the area required for block-placement (floorplanning) • This area is the sum of std cell area obtained after synthesis times some ratio called utilization (typically 70-90%) • The floorplanning should find the exact position of the blocks within the allocated area • Multiple and conflicting objectives, but in general you want speed for less possible area & power (influence by constraints) → Complex problem !!!

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 39 Floorplanned, cell placed & routed

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 40 5. Layout generation Overview • Layout – final design step before IC manufacturing • Layout is used to produce photolithography masks • Once the masks are ready they can be used in production line to manufacture ICs (sent out to pure-play) • Layout completion due date (and hour!) is called tapeout, (history: in the past IC layout copied to magnetic tape to be sent to the fabrication facility) • Tapeout is one of the most important milestones in any chip design, and in general generate loads of stress to engineers • After tapeout, the IC, die is manufactures followed by packaging, and testing and debugging

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 42 From layout to masks • Layout ✦ is a drawing with accurate physical dimensions and positions of all the layers of material that will need to be manufactured to produce an IC ✦ during layout generation we convert circuit schematic to a file used to create the photolithography masks (GDSII) • The area of the entire layout will determine the die area, which makes layout a critical step in determining the cost of manufacturing a given product • In principal it is generated by an EDA tool, but does require manual intervention to generate the right direction for the tools • Masks – required to pattern each layer during the manufacturing process • Layout specialists and mask designers

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 43 Design rules • A wire drawn too thin may be manufactured with breaks in the wire, preventing a good electrical connection • A pair of wires drawn too close together may touch where they are not supposed to, creating an electrical short • To prevent these problems, process engineers create layout design rules that restrict the minimum widths and spaces of all the layers • Additional rules set minimum overlaps or spaces between different layers • The goal is to create a set of rules such that any layout that meets these guidelines can be manufactured with high yield

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 44 Design rules • Design rules are written in terms of a generic feature size λ • A 90-nm generation process might use a λ value of 45nm, which would make the minimum poly width and therefore minimum transistor length 2λ = 90nm • Two different manufactures can call “90-nm” process, but use different values for λ and perhaps very different design rules ... • Some manufacturers focus more on reducing transistor size and others more on wire size • One process might use a short list of relatively simple rules but a large value of λ. Another process might allow a smaller λ but at the cost of a longer more complicated set of design rules

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 45 Layout 247

in estimating the die area required by each portion of the design, which provides an estimate of what the total die area will ultimately be. Failure to meet these density targets can lead to products that are more costly or have less performance than expected. The number and size of tran- sistors needed as well as the limits of the manufacturing process deter- mine the layout area required by a circuit. Awire drawn too thin may be manufactured with breaks in the wire, preventing a good electrical connection. A pair of wires drawn too close together may touch where they are not supposed to, creating an electri- cal short. To prevent these problems, process engineers create layout design rules that restrict the minimum widths and spaces of all the layers. Additional rules set minimum overlaps or spaces between dif- ferent layers. The goal is to create a set of rules such that any layout that meets these guidelines can be manufactured with high yield. Table 8-1 shows an example set of design rules. These example rules are written in terms of a generic feature size l. A90-nm generation process might use a l value of 45-nm, which would make the minimum poly width and therefore minimum transistor length 2l = 90-nm. Of course, what two different manufactures call a “90-nm” process might use different values for l and perhaps very different design rules. Some manufacturers focus more on reducing transistor size and others more on wire size. One process might use a short list of relatively simple rules but a large value of l. Another process might allow a smaller l but at the cost of a longer more complicated set of design rules. The example rules from Table 8-1 are applied to the layout of an inverter in Fig. 8-5. Design rules – example TABLE 8-1 Example Layout Design Rules

Rule label Dimension Description Simple example W1 4l Minimum well width • W2 2l Minimum well spacing D1 4l Minimum diffusion width In practice design D2 2l N+ to P+ spacing (same voltage) • D3 6l N+ to P+ spacing (different voltage) D4 3l Diffusion to well boundary rules set are much P1 2l Minimum poly width P2 2l Minimum poly space much bigger P3 1l Minimum poly diffusion overlap C1 2l × 2l Minimum contact area C2 2l Minimum contact spacing C3 2l Minimum contact to gate poly space When generated, C4 1 Poly or diffusion overlap of contact • l M1 3l Minimum M1 width the layout must be M2 3l Minimum M1 space M3 1l M1 and contact minimum overlap checked that all the rules are effectively respected = Design Rule Checking or DRC (model checking → establishing the correctness of the mask) • This is automated, but needs manual intervention to correct things …

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 46 Design Rule Checking • Basic rules ✦ Width rule – minimum width of any shape in the design ✦ Spacing rule – minimum distance between two adjacent objects

๏ Defined for each layer – lowest layers having the smallest rules (typically 100 nm as of 2007) and the highest metal layers having larger rules (perhaps 400 nm as of 2007) ✦ A two layer rule specifies a relationship that must exist between two layers : e.g. enclosure rule might specify that an object of one type, such as a contact or via, must be covered, with some additional margin, by a metal layer

๏ A typical value as of 2007 might be about 10 nm

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 47 Design rules248 Chapter Eight

D2 C3 • Picture shows widths and spaces Plimited3 by D4 3l design rules • Contains all rules and dimensions required to derive height and width of this piece of Wp 16l layout The height of 44λ is determined by the • M2 C1 M2 3l needed transistor widths and the width & 44l C4 1l M1 C1 2l space of a Metal 1 connectingC4 wire C4 1l M3 M2 3l • The width of 24λ is determined by the size of the well taps, the poly gate, and contactsC22 D1 to the source and drain Wn 12l • Even small changes in design rules can have a very large impact on what layoutD4 D4 3l P1 density is achieved and how layout is most D4 D1 D2 C4 C1 C3 P1 C3 C1 C4 D4 efficiently drawn 3l 4l 2l 1l 2l 2l 2l 2l 2l 1l 3l

24l

FigureULB/BEAMS/MILOJEVIC 8-5 Design rules. Dragomir/ELEC-H-541 48

The left picture in Fig. 8-5 shows some of the widths and spaces that are limited by the design rules. The right picture shows the design rules and dimensions that ultimately determine the required height and width of this piece of layout. The height of 44l is determined by the needed transistor widths and the width and space of a metal 1 con- necting wire. The width of 24l is determined by the size of the well taps, the poly gate, and contacts to the source and drain. Even small changes in design rules can have a very large impact on what layout den- sity is achieved and how layout is most efficiently drawn. In the example shown in Fig. 8-5, the height of the layout is determined in part by the width of the transistors. To create an inverter with wider transistors, the layout could simply be drawn taller and still fit in the same width. However, there is a limit to how long a poly gate should be drawn. Polysilicon wires have much higher resistance than metal wires. Awide transistor made from a single very long poly gate will switch slowly because of the resistance of the poly line. To avoid this, mask designers commonly limit the length of poly gates by drawing wide tran- sistors as multiple smaller transistors connected in parallel. This tech- nique is called legging a transistor. An example is shown in Fig. 8-6. 6. Scaling principals and limitations design point are listed in Table 1.

tech node #ofcores frequency TDP die size average power density LLC per core 45nm 8 2.26GHz 130W 684mm2 0.19W/mm2 3MB

Table 1. Current-generation reference design point

Our scaling methodology projects the following from 45nm to 10nm technology nodes for this design point.

1. Power and power density. This includes active switching power (density) and leakage power (density) of cores, cache power with focus on last-level cache, and power for on-chip network. We assume power for other components such as clock spine and I/O to be relatively small, fixed percentage of total power (approximately 10% each [19, 20]) and to scale in the same fashion as core power.

2. Area. This includes area projection for cores, last-levelcacheaswellashotspotsizewithinacore.Total estimated chip size includes the area for all the cores in the configuration and area for the last-level cache. Another factor that can potentially affect the chip size is the difficulty for fabrication and packaging tech- nologies of scaling down I/O and power-delivery bump (C4) sizes and their pitches. It is not clear whether the need for sufficient C4 bumps will dictate the die area for future big chips. The advent of 3D integration (especially stacking of memory) and on-chip voltage regulators could potentially reduce the number of I/O and power/ground bumps. Qualitatively, the area constraintfromC4doesnotchangethepowerscaling trends. But in a C4-constrained scenario, surplus die area could be used to space out hot units, reducing chip power density, which relaxes the constraint on hot spot temperature. We will leave a detailed study on Nodethis issue as a futureto work.node scaling: gains 3.1 Technology• From andnode frequency to node scaling λ goes down → impact on cell size and For technologyperformance scaling, we (better adopt two area, representative delay, sets power) of scaling parameters that are publicly available. One is from ITRS [21], with a feature size scaling factor of 0.7X, which leads to an area scaling factor of 0.5X. Combined• This with thiscan area be scaling predicted, assumption, since we assume all these that chip parameters frequency as constant are f( toλ) match the trend observed in the last couple of generations of high-end serverprocessorsfromdifferentvendors.Keyscaling parametersInternational from one technology organization node to the next (ITRS) are listed predicts in Table 2. scaling As can be factors seen, the ITRS from scaling has almost• perfect power and power density scaling (half power per generation and constant power density scaling), representingnode an ideal to scaling node trend. for Thekey same perf. scaling parameters, factors are assumed here from applied each technology to CPUs node to : the next.

2 feature size area capacitance (C) frequency (f) Vdd power (CVddf) power density 0.7X 0.5X 0.616X 1.0X 0.925X 0.527X 1.054X

Table 2. Cross-generation scaling factors for our ITRS scaling model. Constant frequency is as- sumed.• λ Same goes set down of factors by used 0.7 for (linear every generation. dimension) meaning that area goes down with factor 0.7x0.7 ≈ 0.5 The other set of scaling parameters based on a recent industryprojectionfromIntel[5]isclosertopractical high-end serverNote chip that trends the for f areais constant and voltage scaling,! Can asyouis verified explain? by recent Intel processor die photos and power• and frequency specifications. We also observe qualitatively similar scaling trends in IBM CMOS tech- nologies. Key scaling factors are listed in Table 3 for this Industry scaling model. Distinct scaling factors are used for each generation in line withULB/BEAMS/MILOJEVIC the published expectat Dragomir/ELEC-H-541ion. This includes a gradually diminishing frequency50 increase (instead of no increase as with our ITRS model). Because of the more conservative area and voltage scaling assumptions and higher frequency target assumptions for our Industry model versus our ITRS model,

3 it would have a higher power and power density for same performance/area every generation. The geometric meansCurrent of power and power industry density scaling factors forscaling our Industry model models are 0.652X and 1.141X in contrast to the 0.527X and 1.054X for our ITRS model.

2 tech node feature size area capacitance (C) freq (f) Vdd power (CVddf) power density 45-¿32nm 0.755X 0.57X 0.665X 1.10X 0.925X 0.626X 1.096X 32-¿22nm 0.755X 0.57X 0.665X 1.08X 0.95X 0.648X 1.135X 22-¿14nm 0.755X 0.57X 0.665X 1.05X 0.975X 0.664X 1.162X 14-¿10nm 0.755X 0.57X 0.665X 1.04X 0.985X 0.671X 1.175X

Table 3. Cross-generation scaling factors for our Industry scaling model, adapted from [5] • Area continues to scale as well as capacitance 3.2 Cores • Frequency gains are lower We assume a homogeneous design with identical cores for each generation. We also assume no change to the corePower architecture supply for this voltage work - our analysis slows will down show the need for architecture changes based on the gap between• desired and estimated power, performance and area. ForLess the growth power in number gains of cores across technology generations, we consider three cases: 1) double cores every• generation, i.e. 8, 16, 32, 64, 128 cores from 45nm to 10nm. 2) less aggressive core scaling, i.e. 8, 12, 16, 24, 32 cores from 45nm to 10nm. In the second case, as we try to make the number of cores an even number, the• scalingIncreased ratios between power every two density generations (impact is 1.5X or 1on.33X, cooling) with a geometric mean of 1.4. 3) For some scenarios, it is possible to scale number of cores in between the first two cases, e.g., to meet a particular TDP, a2Xscalingfactorresultsintoohighapoweranda1.4XfactorleavesTDPunder-utilized.Welabelthethree core scaling cases as 2X, 1.4X,Scaling 1.4+X,respectively. is hitting the wall !

3.3 Last-level cache scaling ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 51

We assume a SRAM-based LLC as is used by most processors vendors, with the exception of high-density embedded DRAM in IBM processors [2]. SRAM cell area scaling isclosetothatoflogicscaling.Ontheother hand, its supply voltage scaling is usually much slower than that of logic circuits [22]. This is required for reliable storage of bits in the presence of variations and cosmic radiation. In this work, we consider three LLC supply voltage scaling options: 1) SV1:aggressivescalingsimilartologic(0.925Xeachgeneration), and 2) SV2:a slower more representative case, specifically, 0.95X, and 3) CV:constantSRAMsupplyvoltage,pessimisticfor now, but likely the norm after a couple of more generations.

3.4 On-chip interconnect power scaling

Arecenteffortonpowermodelingforon-chipnetworks,ORION[23],showsthatpowerperhopremains relatively constant across technology generations. For a scalable network topology, the total on-chip network power is proportional to the number of cores. We use the Intel 80-core processor [24] as the reference point for per-core on-chip network power.

3.5 Leakage power scaling

In recent years, there have been significant efforts, e.g. multiple threshold voltages and body bias adjustment, to keep leakage power from dominating active power. As a result, leakage power has managed to be confined as a relatively constant portion of the total chip power. Constant leakage current per transistor width has been projected by ITRS. Intel also projects 1X to 1.43X scaling factor for chip leakage [5] power density, giving a

4 Deeper look into scaling issues

To demonstrate scaling limitations of the current manufacturing approach we will focus only on limitations due to: A. Interconnect

๏ Delay

๏ Power B. Cost and the link to the manufacturing efficiency C.Lithography issues

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 52 1INTRODUCTION Implementing nanometer-scale ICs begins and ends with wires. Wires are so dominant that little is known about a design’s performance or manufacturability without them. In fact, nanometer design strategies that are not clearly focused on rapid wire creation, optimization, and analysis are destined to fail. This paper describes the requirements for an effective, reliable IC implementation platform for the 90 nm process node and beyond. It begins with a description of the central role wires play in nanometer design and why traditional linear design flows are insufficient. It then describes a new continuous convergence methodology, which has proven highly valuable at 0.13 micron and will be absolutely necessary at 90 nm. Next, the paper describes the key implementation, analysis, and database technologies needed to enable this methodology. Implementing nanometer designs requires nanometer routers that optimize wire creation for both performance and manufacturability. Verifying nanometer designs requires nanometer analysis tools that accurately model physical effects as they would occur in the target silicon. Efficiently representing these designs — most of which will be large digital designs with critical analog circuitry — requires unified nanometer databases with massive capacity and efficient extensibility. Wires must be the centerpiece of any nanometer methodology. Without such a methodology, design teams will not be able to create massively complex nanometer ICs in a timeframe of relevance.

2WIRING DOMINATES NANOMETER DESIGN In nanometer design, wiring delay accounts for the vast majority of overall delay. It is well known that delay has been shifting from gates to wires for quite some time. As shown in Figure 1, wiring delay exceeds gate delay at 0.18 micron and below in aluminum processes, and at 0.13 micron and below in copper. By 90 nm, wiring delay will Scaling:account for some 75% gate/interconnect of the overall delay. As a result, design teams need to shift delay their focus from logicgap optimization to wire optimization.

35 Total delay AI, Si02 Interconnect 30 AI, Si02

25 Gate delay

20

15 Total delay Delay, ps Delay, Cu, low k Wire delay 10 Interconnect Cu, low k 5 Gate delay 0 Total delay 0.65 0.5 0.35 0.25 0.18 0.13 0.1

Feature size generation, micron • AfterFigure 1: 180nm Wire and gate (1999)delay in Al and inversion Cu of the tendency: interconnect becomes2.1 THE CHANGING dominant NATURE OF → DELAY globally, delay is increasing ! In addition to dominating overall delay, nanometer design exacerbates physical effects that introduce substantial delay — notably signal integrity (SI) and IR (voltage) drop. These effects can be considerable even at 0.18 micron. By 0.13 micron, “sign-off” timing analysis tools miss numerous SI- and IR drop-based degradations that are comparable in magnitude to the nominal timing and much more difficult to predict. Yet, many design teams continue to use delay calculations based on over-simplified models (e.g., lumped capacitance) down to 0.13 micron. Doing so results in both reduced performanceULB/BEAMS/MILOJEVIC — due to high margins — Dragomir/ELEC-H-541 and excessive, time-consuming design iterations. At 90 nm,53 timing analysis that does not include SI and IR drop effects is essentially meaningless.

2.1.1 Cross coupling Delay is a function of wire loading and wire drive. At 0.25 micron and above, the primary wire capacitance is due to coupling to electrical ground and is largely proportionate to wire length; doubling the wire length doubles the capacitance. Steiner, or global, routing estimates predict the wire length based on placement.

1 Interconnect delay Solution repeater insertion : L L/2 L/2

2 L L 2 dw = rc d = d +2 rc 2 w INV ⇥ 2 ⇣ ⌘ Technology r – wire resistance / unit length For very long wires, we keep c – wire capacitance / unit length inserting repeaters :

For big L, this is a (design) problem !

This is done during design timing optimization, during synthesis and Place&Route

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 54 Repeater insertion pitfalls

• First : Bulk T1 ✦ Active Extra area (read cost) M1 Via12 ✦ More power performance M2 • But also (and even worse) : Mtop ✦ Loads of via cuts from upper metal layers (since it is semi-global and global wiring that we are optimizing) down to substrate ✦ Use of many routing resources, scarce for advanced technologies → increased routing congestion • To avoid congestion further area increase (& cost)

How many repeaters will be inserted is design/target perf. dependent, and is directly linked to wirelength distribution

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 55 Wirelength distribution of a CPU 14/2/2004 9 • Lot’s of wires are very long … (CanInterconnect you explain the presence Length of 10cm wire?) Distribution

10000

1000

100

10

1 Pentium® 0.5 [um] Pentium® MMX 0.35 [um] Pentium® Pro 0.5 [um]

Number of nets of Number 0.1 Pentium® II 0.35 [um] Pentium® II 0.25 [um] Pentium® III 0.18 [um] 0.01 Low Power Processor 0.13 [um]

0.001 1 10 100 1000 10000 100000 Net Length [um] Source: Shekhar Y. Borkar, CRL - Intel ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 56 14/2/2004 10 Wirelength and BEOL Interconnect Length Distribution • Local vs. global interconnect: significant number of global wires Nets vs. Net Length

• Log – Log 1000 scale Local 100 GlobalTotal

• Exponential Total decrease with 10 length 1 • Global clock – Number of Nets 0.1 not included

0.01

0.001 1 10 100 1000 10000 100000 Length [um] ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 57 5

5 Fig. 13. 9 core heterogeneous MPSoC used for experiments Fig. 11. 2D wire RC model

Fig. 12. 3D wire RC model Fig. 13. 9 core heterogeneous MPSoC used for experiments Fig. 11. 2D wire RC model

F2B. In case of F2F, DCu−Cu is the delay due to the inter-die Cu-CuInterconnect bonding. delay – RC model 5 Figure 11 shows the RC model of a 2D interconnection used D D to derive• Unit delay:L0 and fromL 1repeaterin Eqn. to 1 repeater and Eqn. 2. The total delay in a wire of length L is calculated as shown in Eqn. 3. Fig. 14. Wirelength distribution • Two inverters connected with a wire of given length L Fig. 12. 3D wire RC model D · D D (3) • We can build aL simple= l Pi,( RCl + delayrep ) VII. RESULTS AND EXPERIMENT model for a repeated segment A. Experimental Setup where, DL is the total delay of a wire of length L, Dl is the F2B. In case of• F2F,TotalD Cudelay−Cu theis sum the delay of all duedelay to segments the inter-die delay of wire of length l between two repeaters and Drep is Figure 13 shows one of the design points, obtained using Cu-Cu bonding. Will depend on the gate properties Rout delay• of the repeater. Dl is calculated in Eqn. 4. the flow shown in Fig. 7,Fig. which 13. 9 we core have heterogeneous used to MPSoC carry used out for the experiments Figure 11 showsand the Cin RC model of a 2D interconnectionFig. 11. 2D wire RC used model 3D partitioning exploration and the interconnection analysis. It D D to derive L0 and L1 in Eqn.C 1w and Eqn. 2. The totalCw delay consists of 9 cores with three different micro-architecture, as DLl = Rout +(Rout + Rw) + Cin Fig. 14.(4) Wirelength distribution in a wire of length is calculated2 as shown in Eqn.! 3.2 " indicated by the colors. Each core has 3 memory instances L (instruction memory, vector and scalar data memory). The where, RD is the· D outputD resistance of a repeater and R is outL = l ( l + rep) (3) w widthVII. R ofESULTS memory AND interfaceEXPERIMENT varies across the cores. We have the resistance of the wire of length l driven by this repeater. adjusted the total memory size per core such that the total area A. Experimental Setup D Cw is the lumped capacitance of the wireL D and Cin is the input where, L is the total delay of a wireULB/BEAMS/MILOJEVIC of length , Dragomir/ELEC-H-541l is the of the datapath58 and the memories are similar. delay of wirecapacitance of length ofl between the next two repeater. repeaters and Drep is Figure 13 showsThe one architecture of the design is synthesized points, obtained using using commercial 28nm Fig. 12. 3D wire RC model delay of the repeater.Figure 12D showsl is calculated the RC model in Eqn. of 4. the 3D interconnectionthe flow in showntechnology in Fig. 7, for which 3 different we have cases used - to (a) carry 2D, out (b) the 3D face-to-back, case of F2B, and the delay is calculated as shown in3D Eqn. partitioning 5. and exploration (c) 3D face-to-face. and the interconnection We are carrying analy outsis. 2 It layer memory- C C w w D consists of 9on-logic cores with 3D three partitioning. different This micro-architectur design resultse, in as about 5k inter- Dl = Rout +(Rout + Rw) F2B.+ InCin case of F2F,(4) Cu−Cu is the delay due to the inter-die D3 R C1 R R C2 R R RindicatedC3 by the colors. Each core has 3 memory instances dF 2B 2= o +( o +! tCu-Cu)2 +( bonding."o + t + w) layer interconnections. We have considered pitch dimensions (instruction(5) memory, vector and scalar data memory). The +(RFigureo + R 11t + showsRw + theR RCu)C model4 of aof 2D 10 interconnection µm and 6 µm used for µbump and TSV, respectively in case where, Rout is the output resistance of a repeater and Rw is width of memory interface varies across the cores. We have to derive DL0 and DL1 in Eqn. 1 andof Eqn. F2F. 2. In The case total of delay F2B, we consider Cu-pad pitch of 5 µm. the resistance of theD wire of lengthD l drivenD by thisD repeater. adjusted the total memory size per core such that the total area where, 3dF 2B = TSV + RDLin a wire+ µbump of length. L is calculated as shown in Eqn. 3. Fig. 14. Wirelength distribution Cw is the lumped capacitance of the wire and Cin is the input The delay of Cu-Cu pad (DCu−Cu)iscalculatedinaof the datapath and the memories are similar. capacitance of the next repeater.D LThe architectureB. Results is synthesized using commercial 28nm manner similar to l,asshowninEqn.6 DL = · (Dl + Drep) (3) Figure 12 shows the RC model of the 3D interconnection in technologyl forIn 3 different our experiments, cases - (a) the 2D, datapath (b) 3D face-to-bac andVII. the RESULTS memoryk, AND configu-EXPERIMENT case of F2B, and the delay isC calculatedCu as shownD in Eqn.CCu 5. and (c) 3D face-to-face.rations remainL WeD same are carrying acrossA. the out Experimental 2D 2 layer and 3D memory- Setup designs. Hence, we D − = R +(R where,+ R L)is the total+ C delay of(6) a wire of length , l is the Cu Cu out out Cu on-logiclin 3D partitioning.will discuss This hereD design the impact resultsFigure of in 3D about 13 partitioning shows 5k in oneter- of on the intercon- design points, obtained using D R C R R C2 R delayR ofR wire!C of2 length between" two repeaters and rep is 3dF 2B = o 1 +( o + t) 2 +( o + t + w) 3 Dlayer interconnections. We have consideredthe flow pitch shown dimensio in Fig.ns 7, which we have used to carry out the delay of the repeater.(5) l is calculatednections in Eqn. only. 4. In our future work, we will use this impact as a where, Rout is the+(R outputo + Rt resistance+ Rw + Ru of)C the4 drivingof 10 gate. µm andfeedback 6 µm for to µbump optimize and the TSV,3D datapath partitioning respectively and memory exploration in case configurations and the interconnection analysis. It R C C C Cu and Cu are the resistance and theD lumpedR capacitance,w ofR F2F. InR caseof the ofw F2B, 3DC designs. we consider Cu-padconsists pitch of 9 coresof 5 µm. with three different micro-architecture, as D D D D l = out +( out + w) + in (4) where, 3drespectivelyF 2B = TSV of+ theRDL Cu pads+ µbump and.Cin is the input2 capacitance !Wirelength2 " Distribution:indicatedFigure by 14 the shows colors. distribution Each core has of 3 memory instances The delay of Cu-Cu pad (D − )iscalculatedina (instruction memory, vector and scalar data memory). The of the driven gate. Cu Cu R B. Results different wirelengthsR obtained in the 3 designs. In case of 2D, D where, out is the output resistance of a repeater and w is width of memory interface varies across the cores. We have manner similarEnergy to l,asshowninEqn.6 Model: to be written how total interconnect energy almost 95% of the wires are shorter than 100 µm. However, the resistance of the wireIn of our length experiments,l driven by the this datapath repeater. andadjusted the memory the total configu- memory size per core such that the total area can be modeled based on applicationC is the characteristics lumped capacitance of the wirethe remaining and C is the 5% input of the wires contribute to more than 80% CCu Cw Cu rations remain same acrossin the 2D andof 3D the designs. datapath Hence, and the we memories are similar. DCu−Cu = Rout +(Rout + RCu) capacitance+ Cin of the(6) next repeater. 2 ! 2 " will discuss here the impact of 3D partitioningThe architecture on intercon- is synthesized using commercial 28nm Figure 12 shows the RCnections model only. of the In 3D our interconnection future work, in we willtechnology use this for impact 3 different as a cases - (a) 2D, (b) 3D face-to-back, case of F2B, and the delay is calculated as shown in Eqn. 5. where, Rout is the output resistance of the driving gate. feedback to optimize the datapath andand memory (c) 3D configurations face-to-face. We are carrying out 2 layer memory- R C on-logic 3D partitioning. This design results in about 5k inter- Cu and Cu are the resistance and the lumpedD capacitance,R C Rof theR 3DC designs.R R R C 3dF 2B = o 1 +( o + t) 2 +( o + t + w) 3 layer interconnections. We have considered pitch dimensions respectively of the Cu pads and Cin is the input capacitance Wirelength Distribution: Figure(5) 14 shows distribution of +(R + R + R + R )C4 of 10 µm and 6 µm for µbump and TSV, respectively in case of the driven gate. different wirelengthso t w obtainedu in the 3 designs. In case of 2D, of F2F. In case of F2B, we consider Cu-pad pitch of 5 µm. Energy Model: to be written how total interconnectD energyD almostD 95% ofD the wires are shorter than 100 µm. However, where, 3dF 2B = TSV + RDL + µbump. can be modeled based on application characteristicsThe delay of Cu-Cuthe pad remaining (D − 5%)iscalculatedina of the wires contribute to more than 80% Cu Cu B. Results manner similar to Dl,asshowninEqn.6 In our experiments, the datapath and the memory configu- CCu CCu rations remain same across the 2D and 3D designs. Hence, we DCu−Cu = Rout +(Rout + RCu) + Cin (6) 2 ! 2 " will discuss here the impact of 3D partitioning on intercon- nections only. In our future work, we will use this impact as a where, Rout is the output resistance of the driving gate. feedback to optimize the datapath and memory configurations RCu and CCu are the resistance and the lumped capacitance, of the 3D designs. respectively of the Cu pads and Cin is the input capacitance Wirelength Distribution: Figure 14 shows distribution of of the driven gate. different wirelengths obtained in the 3 designs. In case of 2D, Energy Model: to be written how total interconnect energy almost 95% of the wires are shorter than 100 µm. However, can be modeled based on application characteristics the remaining 5% of the wires contribute to more than 80% Delay as f(L) for 45nm • Unrepeated vs. repeated delay (which is which?)

5000.00#

4500.00#

4000.00#

Delay 3500.00# [ps] 3000.00#

2500.00# Series1#

Series2# 2000.00#

1500.00#

1000.00#

500.00#

0.00# Wire length 100# 300# 500# 700# 900# 1100# 1300# 1500# 1700# 1900# 2100# 2300# 2500# 2700# 2900# 3100# 3300# 3500# 3700# 3900# [um]

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 59 14/2/2004 12 Total Dynamic Power Breakdown Interconnect power – CPUGlobal example clock included • Interconnect consumes 50% of total dynamic power of the IC Gate 34%

Interconnect This power dissipation is due to 14/2/2004 13 • 51% ✦ parasitic capacitance of wires Power BreakdownDiffusion by Net Types ✦ and repeaters (they can not be gated)! 15% Global clock included • 90% of power consumed by 10% of nets global signals Clock power ~40% of the interconnect global • local signals 21% power signals 27% local signals 34% 37% Interconnect design is NOT global clock • 13% power-aware (at this level it local clock local clock is difficult to do anything) global clock 20% 29% 19%

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 60 Interconnect power Total power (Interconnect only) (Gate, Diffusion and Interconnect) Design Planning 87

determine the cost of an individual chip. The cost of processing a wafer does not vary much with the number of die, so the smaller the die, the lesser the cost per chip. The total number of die per wafer are estimated as:6

π(wafer diameter/2)2 π × wafer diameter Die per wafer = − die area 2 × die area

The first term just divides the area of the wafer by the area of a single die. The second term approximates the loss of rectangular die that do not entirely fit on the edge of the round wafer. The 2003 International Technology Roadmap for (ITRS) suggests a target die size of 140 mm2 for a mainstream microprocessor and 310 mm2 for a server product. On 200-mm wafers, the equation above predicts the Design Planningmainstream 87 die would give 186 die per wafer whereas the server die size Manufacturing issues & cost 2 would allow for only 76 die per wafer. The 310-mm die on 200-mm wafer is shown in Fig. 3-7. determine• Wafersthe cost of an are individual cylindrical, chip. The costand of processingIC’s rectangular, a wafer does there is an area not vary much with the number of die, so the smaller the die, the lesser loss that is function of the wafer & die size6 : the cost per chip. The total number of die per wafer are estimated as: 1234

5678910 π(wafer diameter/2)2 π × wafer diameter Die per wafer = − 11 12 13 14 15 16 17 18 die area 2 × die area 19 20 21 22 23 24 25 26 27 28

29 30 31 32 33 34 35 36 37 38 The first✦ term1st just term divides – wafer/single the area of the wafer die areaby the area of a single die. The second term approximates the loss of rectangular die that 39do 40 41 42 43 44 45 46 47 48 not entirely✦ fit2nd on theterm edge – of area the round loss wafer. of rectangular The 2003 International die 49 50 51 52 53 54 55 56 57 58 Technology Roadmap for Semiconductors (ITRS) suggests a target die 59 60 61 62 63 64 65 66 that2 do not entirely fit the wafer 2 size of 140 mm for a mainstream microprocessor and 310 mm for a 67 68 69 70 71 72 server product. On 200-mm wafers, the equation above predicts the 73 74 75 76 mainstream• In die 2003 would ITRS give 186 suggests die per wafer a whereas target the die server size die size 2 would allow for only 76 die per wafer. The 310-mm die on 200-mm 2 of 140 mm2 for a mainstream microprocessorFigure 3-7 310-mm die on 200-mm wafer. wafer is shown in Fig. 3-7. and 310 mm2 for a server 6Hennessy et al., Computer Architecture, 19. • On 200-mm wafers: 186/76 dies per wafer depending on applications (desktop/server) 1234

5678910ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 61

11 12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27 28

29 30 31 32 33 34 35 36 37 38

39 40 41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56 57 58

59 60 61 62 63 64 65 66

67 68 69 70 71 72

73 74 75 76

2 Figure 3-7 310-mm die on 200-mm wafer.

6Hennessy et al., Computer Architecture, 19. 88 Chapter Three

Unfortunately not all the die produced will function properly. In fact, although it is something each factory strives for, in the long run 100 percent yield will not give the highest profits. Reducing the on-die dimensions allows more die per wafer and higher frequencies that can be sold at higherAll prices.dies As area result, not the best OK profits are achieved when the process is always pushed to the point where at least some of the die fail. The density• Density of defects of defects and and complexity complexity of ofthe the manufacturing manufacturing processprocess deter- minedetermine the die yield the die, the yield percentage – the percentage of functional of functional die. Assuming dies defects are• uniformlyAssuming distributeddefects are uniformlyacross the distributed wafer, the across die yield the iswafer, estimated the die as yield is estimated as: −α  defects per area× die area  Die yield= wafer yield × 1 +   α 

Thewhere wafer � is yield a measureis the of percentage the complexity of successfully of the fabrication processed process wafers. (for Inevitablymodern the CMOS process processes flow fails is altogether� = 4) on some wafers preventing any of the die from functioning, but wafer yields are often close to 100 Wafer yield = percentage of successfully processed wafers (often close percent.• On good wafers the failure rate becomes a function of the fre- to 100 percent) quency of defects and the size of the die. In 2001, typical values for defects• Yield per = areafunction were of between the frequency 0.4 and of 0.8 defects defects and per the square size of centimeter. the die 7 The• valueIn 2001: a is defects a measure per area of the >0.4 complexity and <0.8 of defects/cm2 the fabrication (more process processing with moresteps processing lead to stepsa higher leading value) to a higher value. A reasonable estimate for modern CMOS processes is a = 4.8 Assuming this value for a and a 200-mm wafer, the calculationULB/BEAMS/MILOJEVIC of the Dragomir/ELEC-H-541 relative die cost for different 62 defect densities and die sizes. Figure 3-8 shows how at very low defect densities, it is possible to pro- duce very large die with only a linear increase in cost, but these die quickly become extremely costly if defect densities are not well controlled. At 0.5 defects per square centimeter and a = 4, the target mainstream die size gives a yield of 50 percent while the server die yields only 25 percent. Die are tested while still on the wafer to help identify failures as early as possible. Only the die that pass this sort of test will be packaged. The assembly of die into package and the materials of the package itself add significantly to the cost of the product. Assembly and package costs can be modeled as some base cost plus some incremental cost added per package pin. Package cost = base package cost + cost per pin × number of pins The base package cost is determined primarily by the maximum power density the package can dissipate. Low cost plastic packages might have

7Ibid. 8Ibid. Test & packaging • Dies are tested before wafer slicing • Only good dies will be packaged, because packaging adds extra and non-negligible cost • Die assembly and package cost: base cost plus cost depending on the number of pins (~1000k pins would be typical high pin count) ✦ Base cost – depends on thermal : few dollars + ✦ Per pin cost – typically = 0.5 cents/pin

๏ But limit the total power to less than 3 W • High-cost, high-performance packages might allow power densities up to 100 W/cm2, but have base costs of $10 to $20 plus 1 to 2 cents per pinIf • Since for high performance processors power density is increasing, packaging could will represent significant part of the total cost

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 63 Yield improvements • Improving yield is important objective to bring the cost down • Usually the process for the given node improves in time: yield goes up • Yield model becomes function of time

Example of yield Die Yield vs. Die Size • 4Q05 1Q06 calculation 1.0 2Q06 3Q06 0.9 4Q06 0.8 1Q07 0.7 2Q07 How to reconcile 3Q07 • 0.6 4Q07 0.5 1Q08 small die size for 2Q08 0.4 3Q08 Die Yield Yield Die better yield with 0.3 4Q08 0.2 1Q09 2Q09 0.1 increased 3Q09 0.0 4Q09 0 100 200 300 400 500 600 1Q10 functionality? 2Q10 Die Size (mm^2) 3Q10 4Q10

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 64 Lithography issues • Lithography scaling is the key enabler of the Moore’s Law ✦ resolution, tech ✦ critical dimension (smallest feature R) control, tech ✦ overlay accuracy, tech ✦ throughput, $$$ • We have solutions for high-resolution printing methods that can go well beyond 30 nm, but the ultimate limit to lithography scaling will be set by: ✦ Critical Dimension control requirements & ✦ Economics rather than purely resolution performance • So Moore will stop not because we can’t do it, but because we will not be able to afford it ! (after all it makes sense)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 65 2D integration: conclusion

Planar CMOS integration is hitting the wall ! Summary of issues : • Lithography for advanced nodes becomes more and more tricky, causing yield decrease → increased cost • Scaling doesn’t work that good: gains from node to node are hard to get (operating frequency is not going down that much) • Power density of the circuit is increasing, requiring to adapt design techniques prior to cooling (something can be done but at HUGE cost) • Interconnect wall we can compute fast, but communicate too slowly • Cost that is gradually increasing

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 66 7. Overview of solutions to scaling limitations 

 Meilleur contrôle du canal ;  Meilleur pente sous le seuil ;  Réduction des effets canaux courts ;  Fabrication compatible ;

 Moins de possibilité de court-circuiter grille-source/drain) ;

 Pas de problème de désalignement des grilles ;

 Plus facile à fabriquer.

Ch. LALLEMENT 19 9 décembre, 2009

Two approaches (not mutually exclusive) At microscopic scale At macroscopic scale  Multi-gate transistors (fin-FETs) Multi-die integration within the same package

tox WSi tox y System-in-Package (SiP),

Effets de coin z System-in-a-Package or SiO2 Gate Multi-Chip Modules Gate Si HSi = multiple ICs in the same package horizontally or vertically SiO2

Si

N =1018cm-3 Intel, 2011 A Solutions : 1/2 power•Dopage dissipation du canal très faible ; •Coins arrondis ; ~35% more speed Ch. LALLEMENT 20 9 décembre, 2009

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 68

10 … and using the unused dimension ! 3D Integration • CMOS is planar technology (sounds like 2D) • In 2D 3rd dimension (z) is used for Metal and Vias only, not for active devices Active Layer • How to exploit the 3rd dimension? Substrate (bulk silicon) • Three-dimensional integrated circuit (3D IC) is a chip in which two or more layers of active Active Layer electronic components are Substrate (bulk silicon) integrated both vertically and horizontally into a single circuit • is pursuing this technology in many different Active Layer forms, but not yet widely used Substrate (bulk silicon) ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 69 MCM why ? • Obviously multiples dies previously mounted at PCB level can be now integrated in the same package

✦ Better performance of inter-die connections ✦ Lower power of the interconnect … ✦ Overall lower power ✦ Higher integration density ✦ Heterogeneous • But this is too coarse: e.g. integrate memory with logic in the same package (look at the previous example) • This is not enough … we want to look into best possible ways to interconnect different dies in the same package → this will depend on the tech. used for die interconnect !

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 70 How to enable 3D MCM ? Wire Bonding 2 TSV-Based 3D Integration 17

Fig. 2.4 A 3D package by ChipPac consists of four chips that are stacked and bonded. The chips are electrically connected to each other and to the chip carrier by wire bonds Peripheral routing, huge pitch, → Limited N° of connexions with bad performance THIS IS NOT VIABLE ! (although one of the iPhones used this ...)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 71

Fig. 2.5 Irvine Sensor’s Neo-StackTM technology accommodates a variety of different sized chips that are stacked and edge connected to make a module of 4–50 layers that is less than 13 mm thick

the vertical connection density and image fill factor of the 3D chip. Similar 3D chips have been made using bump bonding techniques, but the density is limited by the size of the bond pads which are a function of the chip–wafer alignment budget and the bond pad size. Note that any of the preceding approaches to 3D construction can be embedded into a multichip module to further increase the packing density of the Realistic 3D → key enablers 1. Through Silicon Vias Classic Via(s)

Active layer (FEOL) Substrate (bulk silicon)

Through Silicon Via(s) Direct routing, small pitch (<10µm), → Huge number of fast connexions VIABLE !!!

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 72 TSVs Processing Via first processing with wafer thinning 1. TSV Manufacturing

Chemical Deep silicon Via oxide Cu seed Cu Mechanical etching deposition deposition Plating Polishing

2. Wafer Thinning and Bonding carrier carrier wafer wafer

carrier carrier carrier wafer wafer wafer bottom wafer Temporary Back side Exposed Permanent Temporary carrier thinning Cu nails bonding carrier bonding de-bonding ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 73 Realistic 3D → key enablers 2. Micro(µ)—bumps

They will establish the actual inter-die connection, pitch ~ 30µm (aggressive 10µm) VIABLE !!!

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 74 Realistic 3D → key enablers 3. ReDistribution Layer — RDL

It is a metal layer that can be placed on the backside of the existing die → it will allow to route TSVs and µbumps

TSVs and µbumps do not have necessarily to be aligned → more freedom for the placement & route of the TSVs on the top die.

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 75 Realistic 3D → key enablers

4. Cu Cu bonding 1/2

Small Bigger Cu/dielectric damascene Cu pads Cu pads BEOL BEOL FEOL FEOL Si TSV Si

• No TSV & regular FEOL • Via-middle TSV (after FEOL) • Contact pads below 1x1µm2 • Contact pads below 4x4µm2 • Full or limited back-end • Full or limited back-end interconnect stack, depending interconnect stack, depending on on application application

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 76 Realistic 3D → key enablers

4. Cu Cu bonding 2/2

N+1 — Advanced process

W2W bonding: BEOL-to-BEOL interconnect TSV exposure and backside passivation + CMP thinning

Aligned and bonded Cu Common Back-end pads (eg. 5µm pitch) ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 77 Flavors of 3D Integration 1/2 a) Face-to-Face (F2F) b) Face-to-Back (F2F) no RDL

b’) Face-to-Back with RDL

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 78 Flavors of 3D Integration 2/2 c) Silicon Interposer CI1 CI2

Only bulk silicon with back-end of line:

→ No active devices; semi-active interposers are in vogue → Basically a reticle size limited routing resource → Looks like PCB, but at much smaller scale

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 79 Key enablers: bottom line • Inter die connection density increases !!! • Allow functional block with high IO count (big number of pins) to be moved in another die • Blocks can be either from the existing design or from the outside of the package (PCB) → Example off-chip DRAM

Circuit imprimé Circuit 1 PCB Circuit 2 PCB wires

Logic DRAM huge capacitive load to DDR2,3,4,5 logic circuitry → big drivers that are area and power HUNGRY

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 80 New perspectives

If IO is cheap, why do not SAMSUNG • WideIODRAM increase the datapath width? • Until today cost is main blocker for this approach • With 3D integration, this is not true any more • Birth of new approaches → Wide IO DRAMs instead of 64 → 1200 bit wide data bus !!!!! • Less load capacitance mean smaller drivers, less area, power so better access to DRAM (bottleneck anyhow from system perspective)

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 81 Example of power savings • The N° of TSVs do not influence the cost of manufacturing (10 or 10000 TSVs) - impact only on the array but @5μm diam. and 10μm pitch this is not an issue any more, but impact on the DESIGN:

Increasing the data path resulting in: 16x < F = BW 23x < P

K. Kumagai, C. Yang, et al., “System-in-Silicon Architecture and its Application to H.264/AVC Motion Estimation for 1080HDTV”, ISSCC 2006.

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 82 3D integration: some advantages • Increased density for the same footprint and little bit bigger volume → More functionality • Closer, tightly coupled blocks → Small delays • We can combine circuits manufactured in different technologies: memory-on-logic, logic-on-logic, devices that don’t scale with those that can scale → Integration of heterogeneous circuits • Huge inter-die interconnect density → big number of connections, thousands rather then dozens of inter-die connections → Much more bandwidth • Enables design of new systems (e.g WideIO DRAM) → New product opportunities • Better yield – since smaller dies → Lower cost

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 83 3D is real... The Accelerometer To get such a compact device, the ASIC is stacked above the MEMS structure. The MEMS structure is carefully protected in a bonded silicon lid. Cracking off the silicon lid (requiring considerable skill), we can expose the MEMS device. The top structure is the Z- axis sensor, and the bottom structure contains the X and Y sensors.This is the ASIC die used to process the tiny capacitive signals, and create a standard SPI/I2C digital interface, and several smart features such as click and double-click recognition, wake-up, and motion detection.

ULB/BEAMS/MILOJEVIC Dragomir/ELEC-H-541 84 … but still not mainstream tech yet

Manufacturing, Test & Yield Power density, peak temp.

Standardization, supply chain Design flow ULB/BEAMS/MILOJEVIC26 Dragomir/ELEC-H-541 85