<<

ESSCIRC 2002

Future Trend of Design (Invited Paper)

Robert Yung, Stefan Rusu, Ken Shoemaker Corporation, Santa Clara, California USA [email protected]

the gate oxide by direct band-to-band tunnelling limits 1. Introduction physical oxide thickness scaling and will drive high-k gate In the past thirty years, advances material adoption. Sub-threshold leakage current will have fundamentally changed the practice of business and continue to increase. Researchers have demonstrated personal . During these three decades, the wide experimental devices with a gate length of only 15nm, acceptance of personal and the explosive which will enable chips with more than one billion growth in the performance, capability, and reliability of by the second half of this decade. computers have fostered a new era of computing. The While bulk CMOS scaling is expected to driving forces behind this new computing revolution are continue, novel transistor structures are being explored. due primarily to rapid advances in Figure 5 shows a cross-section of the Depleted Substrate and . Transistor (DST) [2] that has higher active drive and lower In this paper, we will examine key architectural and leakage current than the bulk CMOS technology. technology trends that affect microprocessor designs in the next decade. 4. Interconnect Scaling 2. Microprocessor Evolution As advances in lithography decrease feature size and In 1965 observed that the total number transistor delay, on-chip interconnect increasingly of devices on a chip doubled every 12 months at no become s the bottleneck in microprocessor designs. additional cost. He predicted that the trend would continue Narrower metal lines and spacing resulting from process in the but would slow after 1975 [1]. Known widely scaling increase interconnect delay. Figure 6 shows that as the Moore’s Law, these observations made the case for local interconnects scale proportionally to feature size. continued wafer and die size growth, defect density Global interconnects, primarily dominated by RC delay, are reduction, and increased transistor density as technology not only insufficient to keep up but are rapidly worsening. scaled and manufacturing matured. Figure 1 shows that Repeaters can be added to mitigate the delay but consume in leading has doubled in power and die area. Low resistivity copper metallization each technology node, appropriately every 18 to 24 and low-k materials such as fluorine-doped SiO2 (FSG) are months. Factors that drove up transistor count are employed to reduce the worsening interconnect scalability. increasingly complex processing cores, integration of In the long term, radically different on-chip interconnect multiple levels of caches, and inclusion of system topology is needed to sustain the transistor density and functions. Figure 2 shows microprocessors’ frequency has performance growth rates as in the last three decades. doubled in each generation, results of 25% reduction of gates per , faster transistors and advanced circuit 5. Packaging design. Figure 3 shows die size has increased at 7% per The microprocessor package is changing from its year while feature size reduced by 30% every 2 to 3 years. traditional role of protective mechanical enclosure to a Together, these fuel the transistor density growth as sophisticated thermal and electrical management platform. predicted by Moore’s Law. Die size is limited by the reticle Recent advances in microprocessor packaging include the size, power dissipation, and yield. Leading migration from wirebond to flip-chip and from ceramic to microprocessors typically have large die sizes that are organic package substrates. Looking forward, emerging reduced with more advanced process technology to package technologies include the bumpless build-up layer improve frequency and yield. As feature size gets smaller, (BBUL) packages, which are built around the die fig ure 4 shows that longer pipelines enable frequency [3]. The BBUL package provides the advantages of small scaling, which has been a key driver for performance. electrical loop inductance and reduced thermo -mechanical 3. Transistor Scaling stresses on the die interconnect system using low dielectric constant (low-k ) materials. This packaging Device physics poses several challenges to future technology allows for high pin count and easy integration scaling of the bulk MOSFET structure. Leakage through of multiple electronic and optical components.

43 6. Power Dissipation to achieve this are hyper-threading, also known as multi- Power dissipation increasingly limits microprocessor threading, and chip (CMP).. performance. The power budget for a microprocessor is becoming a design constraint, similar to the die area and 9. Input/Output target frequency. Supply voltage continues to scale down Performance increases lead to higher demand for with every new process generation, but at a lower rate that sustainable between a microprocessor and does not keep up with the increase in the clock frequency external main memory and I/Os. This has led to faster and and transistor count. Figure 7 shows that power increases wider external buses as shown in Figure 10. In the future, with frequency for two architectures and the last high-speed point-to-point interconnects will replace two process generations. Architectural techniques like on- shared buses to satisfy incre asing bandwidth die , and circuit methods such as clock requirements. Distributed interconnects will provide a more gating and domino to static conversion, are employed to scalable path to increase external bandwidth when control the power increase of future microprocessors. practical limit of a pin is reached.

7. Clock Speed 10. Conclusion Microprocessor clock speed increases with faster No fundamental barrier exists to extending Moore’s Law transistors and longer pipelines. Figure 4 shows that into the next decade. As feature size continues to decrease frequency scales with process improvements for several by 30% in each process generation, the number of generations of Intel microprocessors with different micro- transistors in a high performance microprocessor doubles. architectures. Holding process technology constant, as This vast increase in the number of on-chip transistors the number of stages increase from 5 to 10 to 20 allows integration of critical functions as well as greatly ® ® ® from the original Intel through the , enhances microprocessor performance. clock speeds are significantly increased. Frequency increases have translated into higher application Processor-to-memory gap continues to widen as performance. Additional transistors are used to reduce the microprocessor speed increases faster than main memory. negative performance impact of long pipelines; an example Integration of multiple levels of memory reduces the is increasingly sophisticated branch predictors. Process impact of slow memory. Larger cache sizes reduce conflict improvements also increase clock speed in each processor and capacity misses more than coherency misses . family with similar number of pipe stages. Later designs in Integrating memory and I/O controllers in a microprocessor a processor family usually gain a smaller frequency will reduce memory access and re duce advantage from process improvements because many bandwidth requirements. micro -architectural and circuit tunings have been realized Worsening global interconnects in a microprocessor in earlier designs. Some of the later microprocessors are pose an important challenge to frequency scaling and also targeted to a power-constrained environment that further integration. Improved metallization and low-k limits their frequency gain. material are medium-term solution. Long-term solution may lie in restructuring the microprocessor and system 8. Cache to minimize communication costs between Microprocessor clock speeds and performance components on a die as well as in a large system. demands have increased over the years. Unfortunately, external and latency have not kept High power dissipation becomes a critical barrier to pace. This widening processor-to-memory gap has led to frequency and performance scaling of a microprocessor. increased cache sizes and increased number of cache Depleted Substrate Transistor and advanced power levels between the processing core(s) and main memory. management are promising ways to curtail the rapid growth Figure 8 shows the size of the first and second level of microprocessor’s power dissipation. caches in the last 7 generations of Intel microprocessors. As frequency increases, first level cache size has begun to 11. References decrease to maintain low access latency, typically 1 to 2 [1] G. Moore, “Cramming more components onto integrated , as shown in Figure 9. circuits”, , Vol. 38, No. 8, April 19, 1965. [2] R. Chau et.al., “A 50nm Depleted-Substrate CMOS As aggregate cache sizes increase in symmetric transistor (DST)”, IEDM Tech. Digest, 2001. multiprocessor systems (SMP), the ratio of conflict, [3] S. Towle et.al., “Bumpless Build-Up Layer Packaging”, capacity, and coherency misses, or cache-to-cache ASME Intl. Mech. Eng. Digest, 2001. transfers, will change. Set associative caches will see [4] International Technology Roadmap for , reduction in conflict and capacity misses relative to cache 2001 edition. size increases. However, these increases will have smaller [5] Intel Microprocessor Reference Guide, April 2002. impact on coherency misses in large SMP systems. This (http://www.intel.com/pressroom/kits/quickref.htm) motivates system designers to optimize for cache-to-cache [6] Standard Performance Evaluation Corporation, April 2002. transfers over memory-to-cache transfers. Two approaches (http://www.spec.org)

44 1,000,000,000 freq: 5 stages freq: 10 stages 100,000,000 freq: 20 stages perf: 5 stages 10000 perf: 10 stages perf: 20 stages 10000 10,000,000 e

t 1,000,000 )

100,000 1000 1000

10,000

1,000

Transistor coun 100 100 frequency (Mhz 100

10 relative integer performanc 10 10 1971 1976 1981 1986 1991 1996 2001 2006 Year of introduction feature size (nm) 100 1000 Figure 1 – Transistor count doubles every 18-24 months [5, 6] Figure 4 – As feature size gets smaller, longer pipeline enables frequency scaling which is a key driver for performance [5, 6]

10000 100 frequency

gate delay / clock ® Pentium 4 ® Pentium III 1000

® Pentium II 10

® 100 Frequency [MHz] Gate delays per clock 486 ® 386 Pentium

10 1 1987 1991 1995 1999 2003 Year of introduction Figure 2 – Frequency doubles and number of gates per clock reduced by 25% each generation [5, 6] Figure 5 – Cross-section of a raised-source/drain depleted substrate transistor (DST) on thin silicon body [2] 1000 0.10 die size Feature size (nm) feature size ) 250 180 130 90 65 45 32 ) 100 Gate delay (fanout 4) Local interconnect (M1,2) 100 1.00 Global interconnect with repeaters Global interconnect without repeaters

die size (mm2 10 feature size (um

10 10.00 1 1971 1976 1981 1986 1991 1996 2001 2006 Relative delay Year of introduction Figure 3 – Feature size reduces by 70% every 2 to 3 years. Die 0.1 sizes grow at 7% per year [5, 6] Figure 6 – On-chip interconnect trend [4]

45 80 4000 Pentium® 4 70 0.18um 3000 Bus Bandwidth 60

50 2000 Pentium® III Pentium® 4 40 0.13um 0.18um 1000

Power [W] 30

Pentium® III Bus Bandwidth (MB/sec) 20 0 0.13um

10 386 486

0 Pentium® 500 1000 1500 2000 2500 Frequency [MHz] Pentium® Pro Pentium® II (.35u) Pentium® II (.25u) Pentium® 4 (.18u) Figure 7 – Processor power as a function of frequency for two Pentium® III (.25u) Pentium® III (.18u) process generations [5] Figure 10 – Memory and I/O bandwidth are crucial to sustain high processor performance 32 512

) 28 448 ) 24 384 20 320 16 256 12 192 8 128

L1 cache size (K L1 cache size L2 cache size (K 4 L2 cache size 64 0 0 386 486 Pentium® Pentium® Pro Pentium® II (.35u) Pentium® II (.25u) Pentium® 4 (.18u) Pentium® 4 (.13u) Pentium® III (.25u) Pentium® III (.18u)

Figure 8 – Increasing on-chip cache sizes reduce the impact of widening processor-memory gap

10 L1 cache latency L2 cache latency ) 8

6

4

Latency (clocks 2

0 486 Pentium® Pentium® Pro Pentium® II (.35u) Pentium® II (.25u) Pentium® 4 (.18u) Pentium® 4 (.13u) Pentium® III (.25u) Pentium® III (.18u)

Figure 9 – Short L1 cache latency dictates small L1 cache size. L2 cache latency is less critical and allows larger L2 cache sizes.

46