<<

On ambient intelligence, needful things and process tech nologies

C.J. van der Pael*, F. Pessolano, R.Roovers, F.Widdershoven, G. van de Walk, E. Aarts and P Christie Philips Research

The ongoing miniaturization of electronic circuits and responsive to the presence of people. Such an the corresponding exponential increase in embedded environment should be computational power is reaching the point where it - Ubiquitous: surrounding the user by a multitude of becomes viable to integrate electronics into people interconnected systems environments. Ambient Intelligence pasten] refers to - Transparent: integrated and “hidden” into the an electronic environment that is sensitive and background . Intelligent: adapting to the people that Live in it The potential to distribute functionality over a network of devices is determined by the power resources of the device and upon considering these demands it appears helpful to further classify in-home Ambient Intelligence “devices” into three distinct classes illustrated in figures 1 and2. - The “watt-node”, taking care of the major information processing computational intensive tasks, e.g. 3D TV at high data rates, networked games etc: a mains connected ‘static’ device; the power is limited by package cost (a few Watts for traditional consumer products to 100 Watts for PC Figure 1: In Home AI nodes like devices with

._6 COrnmunEaton ratebps

,G canputingops ____

1mW 1w Power

Figure 2: AI Power nodes *) carel.van.der.poelCphilips.com

3 0-7803-8480-6104/$20.00BZW IEEE - The “milli-watt-node”, representing low power some educated guessing on markets and production processing, Computational Efficient [de Man I, costs. mobile, personal and connected (audio /video) The result of such an analysis is depicted in figure 4, devices in the home. These are portable devices where the number of future applications is given per with a rechargeable battery; the power is linuted by process node (from 0.12um till 0.045um). Let’s battery technology. consider the standard industrial case, where we choose - The “micro-watt-node”, representing a multitude of the process node in order to minimize costs only for a “electronic dust” devices that perform given an application requiring a given performance. environmental tasks like sensing, identification, This is illustrated in figure 4a. Obviously, future positioning, etc. These are autonomous devices applications have been identified that require advanced powered by energy scavenging or lifetime battery technologies like 0.045um. However, their number is decreasing. . In a different scenario, where more It is envisaged that each of the three classes represents advanced technologies also provide a reduction in about equivalent “in-home” Si-area. In this paper we power by means of combined technology scaling and try to map the system needs associated with these circuit design, power-limited applications (e.g. for nodes, differing by orders of magnitude with respect to mobile devices) would find their optimal node in a the amount of information to be processed as well as more advanced CMOS technology (see fig 4b). the available power, onto requirements for Si process technology choices. Watt Node: digital technology As electronics the industry gets more mature, and technology development and deployment increasingly expensive and complex, the necessity of following Moore’s law is starting to be doubted. Furthermore, the expected next deep sub-micron technologies (0.065um

0.12 0.09 0.065 c=0.045 Process Node

Figure 4 Application distribution per process node minimidng (a) mainly costs, (b) cods and power, This is, of course, valid in case technology indeed is capable of providing a power reduction. Such assumption is so far true till 0.090um. At the same time, we are showing that the next process nodes should be developed with more focus on power: power and not 0” 01 02 01 04 05 a6 07 0.8 09 IO I., Nod: (urn) speed should drive technology development. We can take this analysis a step further so as to assess Moore’s Figure 3: Si Area Price trend vs ITRS node law timing. and 0.045um) seem unable to provide further leaps in The case for 0.065um technology is described in figure performance, while costs are steadily growing (figure 3) 5. The ITRS roadmap expects this technology available A simple question thus arises: do we need advanced in 2007. If we look at the introduction time for technology and, if so, in what form? applications that would use this process node if We try to answer such an apparently trivial question by developed in the traditional way (referred to as ITRS- not restating Moore’s law, but looking at it from based - Fig sa), we can observe a gap of 2 years before another angle: the future applications themselves. Even the technology is used. When repeating the analysis for if with some approximation, we can already peek into 0.045um a 4-year gap results. If the 0.065um node were the future by analyzing industry and academic developed with power as main optimization factor, the roadmaps for the next 15 years. Based on these technology would he required from industry already in roadmaps, we have made a first assessment of the 2004 (figure 5b). required performance, system architecture and, with

4 and multi-gate device architectures appear to be needed for viable solutions to many of the leakage current and drive strength limitations of classically scaled poly- Si/SiOz devices. A similar argument holds for the rapid development of additional embedded functionality options like Flash, RF/Analog and Power. These are not just incremental improvements over existing approaches and our challenge is to explore in an early phase the impact of these new technology choices on system-level performance at the level of standard cell arrays (-100k cells) and embedded memory blocks (-1Mb). 2004 2005 2006 2007 2008 2009 2010 2011 2012 Due to the uncertainties associated with the validity of different performance metrics in different system-level Figure 5: Application introduction yean for 0.06Sum (a) with scenarios, technology assessment is performed using a standard lTRS.based and (b) power-bared technology development. virtual design flow called PSYCHIC (Parameuic System-level Characterization of Integrated Circuits), From these observations, we can conclude that modeled on the Philips SoC Design Environment, see technology pace is too high if development overlooks Figure 6. power, while it would be too slow when optimized for We begin by defming target lor specifications and use power. front-end TCAD process simulators to design candidate When this is true, power would he also the main reason devices architectures within each of the technology that keeps Moore’s law still valid, as the technology options which meet the specifications, where possible development pace would be just enough to satisfy the (see Figure 7). Generally, we use the “classical” low- demand. Most applications of the future will require power poly-siliconlnitr-ided oxide device with a supply some mix of technologies, ranging from pure digital voltage of 1.2 V as the reference device and use this CMOS to sensors, which will render digital CMOS the device to determine the I,, current specifications. less blocking part of the technologies involved in the Different device technologies are then mapped on to system of the future. these reference le and I,, specifications of the classical reference device by allowing the supply voltage to vary. Virtual Technology Chain In the case of double gate devices, for example, the The trend in front-end technology development is greater channel control allows the specifications to be unmistakably in the direction of increasing number of met by a dramatically reduced supply voltage if just process options, with multiple threshold voltages and 0.76V. low-power, general purpose, and high-performance To allow the device technologies to be assessed within application domains being offered at the 65nm node. complex system-level environments, the TCAD process The situation at the 45nm node is likely to he even more diverse since metal gate, high-k gate dielectric, v,l.2vselr I,

Figure 7: Front End Technology Mapping with Lgate (Physiral)=45 nm, CET (nitrided oxidel=Z.S nm simulator is used to extract timing and power matrices Figure 6 Closing the Loop on System Level suitable for use in the standard design flows of ASIC technology exploration.

5 cell libraries and memory cells. This requires that we based on so-called Rentian wire length distribution. embed the devices within an appropriate cell layout. Using a range of wiring signature parameters extracted Both logic and memory cells are designed using from a wide range of Pbilips circuit layouts, we have lithographically driven design methods in order to developed a scalable model of the wire length ensure manufacturability under a variety of optical tool distribution produced by commercial placement tools. scenarios. By regularizing the layout style to reduce the This wire length distribution is allocated to individual spatial bandwidth, it is possible to achieve a much more wiring layers using a pseudo-routing method, where the manufacturable cell library and reduce the necessity of shortest wires of the distribution are allocated to the expensive optical assist features, such as phase-shifting lowest wiring layers, and successively longer wires are masks. Our analysis has shown that these techniques allocated to higher layers, taking into account have minimal impact on overall cell array area, timing, inefficiencies in the routing algorithm, and the routing or power. resources required by clock and power distribution. Once an appropriate cell layout style has heen Benchmark critical paths can he connected by sampling determined, the TCAD process simulators can be used the required connecting wires from the pseudo-placed to extract timing and power matrices to be used within and pseudo-routed wire length distribution in each system-level benchmark performance simulations. A wiring layer. These wire lengths may he converted to similar process has been developed for dynamic power capacitive loads by extracted capacitance per unit dissipation matrices. length data from a variety of hack-end architectures. This matrix-based device characterization may he used Figure 9a shows the results of an analysis of delay for to couple front-end and back-end technology options if representative wire lengths in different wiring layers

Figure 9 Technology Comaparisom (a) Dday (b) Dynamic Figurn 8: Stochastic Critical Power Path Analysk: (a) nearest neighbor; (b) cell distribution low-power technology options all the way from 0.18um can be used to determine appropriate wire load node to the 0.045um node. For this analysis the effect capacitances. The essential problem is illustrated in of adopting "classical" Bulk Scaled (BS), Double-Gate Figure 8a, where a given critical path within a (DG), and Fully Depleted Silicon- benchmark circuit is connected by wires, which, in On-Insulator (FDSOD devices within a "classical" general cannot he assumed to implement nearest- hack-end reference architecture at the 0.045um node neighbor connectivity within the cell array. has been studied. The error bars are used to indicate the The situation is more likely to be represented by Figure minimum, modal, and median delay determined by 8h, where the global optimum cell placement results in 10,000 trials of the stochastic timing analysis method. a distribution of shorter and longer wire lengths. Within Figure 9h shows the results of the dynamic power the virtual design flow, we model the optimum analysis for the same technologies choices. placement using a model of the place and route process

6 mW node: Embedded Memory and Sensory signals acquired by capture devices (e.g., cameras, microphones, tactile sensors) are transformed RF-SIP into suitable representations (e.g., frequency spectra of audio fragments) and stored in a memory map. E.g., for vision tasks a sequence of consecutive image frames, Embedded memory stored in the memory map, is updated with one new In the following we discuss memory demands in frame at a time. Between successive updates a feature relation to 2 typical features of AI systems. extraction engine, consisting of a set of parallel 1. Identification and personalization processors, extracts features (e.g., edges, motion An AI system typically would consist of a group of vectors) from the stored data. This requires multiple interacting devices, some of them autonomous. Most of random read accesses. Writing back to the memory map these devices need to be uniquely identified. is avoided to prevent data coherency problems in SIMD Furthermore, personalization of devices (e.g., in user (single instruction multiple data) parallel code interface terminals, autonomous sensors and actuators) is desired to enhance the overall of the AI system. Therefore, some level of programmability is essential, even for the simplest components of an AI system. Especially if they need to be able to service drn unpredictable power-down events, on-chip nonvolatile storage in relatively small memories is the preferred solution. For such applications small memory module Figure 11. Layered slrncture of typical recognition size and low read voltage and write power are more execution. Instead the extracted feature parameters are important than high bit-density in the memory array. stored into a shared memory. This first part of the As an example embedded 2T floating gate nonvolatile recognition engine is characterized by a streaming data memory [Dormans], which combines medium-density flow, consisting of single page-mode writes and Flash (code storage) and EEPROM (data storage) by multiple random-access reads of the memory map. For ~~..~. ~ low power and area consumption this has to he a high- i density on-chip memory (i.e., DRAM-like). However, it , doesn’t necessarily need the random-access write and ! high write endurance of real DRAM.

The last part of the recognition engine consists of an object-processing engine that associates detected features with objects. In this part the data flow is more of a non-streaming nature, featuring multiple fast Figure 10. Left: Embedded 2T Flasb cell in 0.14-prn CMOS random-access read and/or write cycles in the shared (0.50 pm?. Right: !U!-m-node compact SONOS cell (0.17 memory. Often probabilistic object processing is pm*). AG: access gale, CG: control gate, FG: floating gate. needed to he able to draw reliable decisions from design in the same process, is ideally suited for this uncertain input data (e.g., corrupted by environmental purpose. Currently it is available in 0.18, 0.16 and 0.14 noise), i.e., calculating the probabilities of all possible um CMOS (figure IO), and is under development for values of variables instead of a single deterministic the 90-nm and 65-nm nodes, respectively. For the value. Therefore high-density SRAM would he a good longer term, because of their limited scalability, choice for the shared memory. Furthermore, a replacement of floating gate Flash cells by SONOS nonvolatile memory is needed to store a database of [Duuren I or nanocrystal [ Muralidharl cells is known features and objects. proposed. With high-K top dielectrics these cells can he written at much lower voltages than floating gate flash. Although the relentless progress in memory scaling as driven by Moore’s law has been impressive ( figure 121, 2. Recognition engines embedding of 2 completely different demanding Recognition tasks are essential to make AI systems technologies as FlashEEPROM and DRAM in the behave intelligently. Optimal partitioning into high- same chip is not a real option. The Holy Grail would be performance power-efficient hardware and flexible a unified memory that combines all the positive features software is necessary to reconcile computation power of the above-mentioned memories. Furthermore, to demands with low power consumption. The typical fight the rapid price-erosion of ICs easy portability to layered structure of recognition tasks (figure 11) can be future process nodes is essential to avoid expensive used to define optimal recognition engine architectures. redesign when switching from one to another embedded memory type. Therefore, a well-balanced combination We conclude that w.r.t. Memory partitioning into fast of high-density SRAM and scalable unified memory SRAM and high-density embedded SUM is the best (SUM) concept would be ideal. On the short tenn approach for AI. PCRAM seems to he the hest compromises have to be made to be able to introduce candidate to unify non-volatile and DRAM-like the first generations of SUM. From the above it can be functionality, even at the expense of a limited write concluded that a compromising write speed and endurance. Scalability is a must for economic reasons, endurance would he a good solution. E.g., storing a makmg MRAM less obvious. frame of 1024x768 pixels with a resolution of 16 hitdpixel requires 12 Mbits. At a capture rate of 50 RF and System-in-package With the increasing demand for bandwidth that started with the fxst digital standards like GSM and DECT, the frequency hands used for wireless connectivity are gradually increasing from 0.9 GHz upwards. This trend has been enabled by a rapid progress in RF process technologies [Slotboom], see figure 13.

A multitude of applications is becoming available like Bluetooth, Zigbee, WLAN (802.1 lx), UWB. Besides connectivity also digital broadcast and satellite navigation will find their way to the mobile consumer.

cut-off Gerrnani 'I' QaSe fhicknsa 1 I' J I_ - I,- cm I_ F (urn) OW,"

Figure 12 Manory cell sile trends Raised Bate Mory5em. frameslsecond a maximum number of 1 .6x101' frames IC],' - 100 nrn will be captured and stored into the memory map in 10 STI , 1 GH! invention years. The streaming data storage enables parallel D-Poly ,' writing of many bits to achieve a high data throughput Poly-erp' even at rather slow write-access times. However, fast random read access is necessary. LOCOS MRAM [Durlam], currently the main contender for 10MHz unified memory, features many of the requirements. However, it also requires high write currents that don't scale well to future generations. SONOS is another dftusion 100 kkz candidate for the first SUM generations. With a high-K alloy junction - lo@ (top) blocking oxide and a medium-thick (bottom) Point Contaqt tunnel oxide (-2.5 nm) it is a nonvolatile memory. With 1940 1960 1980 2000 a thinner bottom oxide the write endurance can he year increased significantly (at the expense of the data - retention), rendering it possible to use it for memory Figure 13 : Bipolar Technology Miletones map storage in recognition engines. Contrary to floating gate flash SONOS doesn't develop so-called extrinsic The driving forces for RF technology are multi-band, hits degradation. configurability, low power, miniaturization and Phase-change RAM (PCRAM) seems to have a good integration. In contrast to the digital domain the perspective to become the first SUM for AI. Ifs write complete functionality of analog RF circuitry and endurance (up to IO'* cycles have been reported) is especially the passive devices like filters, oscillators, enough for AI applications. Furthermore, it has inductors, capacitors and antenna's cannot he integrated reasonable write speed (-50 ns [Lai]). Using novel in a single Si-based technology. The continuous scaling phase-change materials (e.g. doped ShTe, [Lankhorstl) of CMOS also creates an increasing mismatch w.r.1. write speeds can he reduced to the 10-ns range, supply voltages that are required for high performance although write endurance is not yet on spec. active RF devices. Unavoidably, RF solutions are composed of a technology mix ranging from discrete

8 passives to advanced RF CMOS, BiCMOS and U1 ~ V as dielectric. A higher density is achieved either by transceiver devices and circuits. integrating either materials with higher dielectric To reduce the overall component count and come to a constant like Taz05 [Beye] or HD2 [Hul or by flexible modular solution a system-in-package (SIP) increasing the surface area. approach is the logical route to integrate passive In fig. 15 a SEM cross-section is shown where micro functions. Dependent on the specific cost and pores are etched in the Si substrate [Roozeboom]. performance requirements given by an application, the These structures give an increased surface area for planar MOS capacitors resulting in a 30x capacitance aluninur density increase, when compared to normal planar structures. Due to the distributed nature of the pore capacitor grid, good RF performance is achieved, which make the technology well-suited for RF decoupling and low-pass filtering. A fulther functional extension of the Si-hased RF Sip platform is the integration of switches and tunable capacitors using micro-machining (MEMS) (see fig.16).

I/ ti sili I / pessivatm

Figure 14 : Crossseetian of B thin-film passive inlegation technology on Si. Si N in betwoen the AI metallization farms the MIM capacitor. The 5 pm top metal is used for low-loss inductors optimum technology mix and circuit partitioning is taken. IC-compatible thin-film technologies are used with limited mask steps and minimum feature sizes of I P (fig 14). bottom electrode Although ideally isolating substrates like ceramics or I glass should be used [Bepel, silicon is favored because of the abundance of manufacturing capacity. To Figure 16: SEM photograph of a capacitive milch, processed in the metal-based passive inlegration improve RF performance high-resistively silicon, in teehnoloev combination with damage implantation, like H [Chin] or AI [Beek] is applied. Thus, excellent RF Because of the low-loss characteristics of these metal- performance is achieved: Q values above 50 in the based devices, they are ideally suited for adaptive frequency range of 2-4 GHz. impedance matching in the RF transmit path [Rijks] or phase shifting for phased-array antennas [Natale]. RF MEMS will enable hardware reconfigurability and adaptivitv for multi-band transceiver architectures. ‘I

Figure 15 : SEM cross-section of dry-etched pores in high-ohmic Si Figure 17: Silicon as platform for integration of RF Planar, low-loss capacitors have however a limited passive devices and circuits capacitance density when only Si02 or SiNx is applied

9 The analogue nature of the RF front end and the technology can he traced hack to the red-brick wall required flexibility to respond to the market dynamics, scaling limits, timely introduction of added options like which are essential driving forces for RF integration, memories and RF, and - more severely-, the strong imply the need for a flexible mix and match strategy upward trend in the cost of Silicon as well as in the with a limited set of technologies, each in its own rising cost of “re-use” and Software implementahons. forming an integration platform for either active circuits, passive devices and interconnect (fig.17). In For the “milliwatt” node, power efficiency is governing our opinion, this inevitably leads to a Sip concept as most of the challenges. The trade-off and optimal the only realistic choice for serving mass-volume partitioning of the available process technology choices consumer applications within the short innovation like CMOS, RF, Passive Integration is expected to lead cycles that are typical in this industry. into “systems in a package”.

The pWatt node Whereas for the milli-watt node, rechargeable batteries From the analysis of a simple case, i.e. a stand-alone are still an option, for Electronic Dust“ pwatt” devices, it is expected that they in principle should he self- Temperature sensor in an architecture choice as displayed in figure 18, it becomes apparent that supplying, i.e. during the lifetime of the device, no new power supply can he expected to be available. Breakthrough solutions for the power problem are still needed to fully realize such fully functional elecaonic Y dust. Network It is expected that the introduction of Ambient Intelligence Devices will he “gradual” and its rate of Energy c~lle~tionand s10mge introduction he strongly dependent on the necessary technology advances that remam to he solved. Figure 18: Architedure of a Smart Dust Device References: [Basten] T.Basten et al, Ambient Intelligence, Kluwer - Advanced CMOS technologies do not play a role in 2003, ISBN 1-4020-7668-1 and references therein such applications [de Man] : H. de Man, Keynote Esscirc 2002 - Very low off state leakage, duty cycle management [ Dormans G.Dormans et al., NVSMW 2003, p. 21. and embedded dense capacitors for power handling 1: [Duuren]: M. v. Duuren et al., NVSMW 2003 are requested aspects [Muralidar]: Muralidhar et al., lEDM 2003, p. 601. - The main power usage is in the communication link [Durlam] M. Durlam et al., IEDM 2003, session 34. [Lai]: S. Lai, EDM2003, session IO - to supply the sensor’s power budget, the energy scavenging device component area is of the order [Lankhorst]: M. Lankhorst et al., CREMSI 2004 Workshop, Fuveau (France), March 26,2004 of a cm2 and will determine the device volume [Slothoom]: J. Slotboom BCTM 2003, Keynote From these estimates of the power needs of these Presentation. devices, in comparison with the available power from [Beyne]: E. Beyne, techn.Dig. ISSCC, 138,2004-04-23 small volume rechargeable batteries and/or energy [Chin]: A. Chin et.al., Proc IEDM, 375,2003 scavenging methodologies it is concluded that till major [Beek]: J. van Beek et.al., Mat .Res. Symp. Proc. B. obstacles remain to be solved before such devices can (2003) he implemented. In addition, sensing, MEMS and [Hu]: H. Hu et.al., Proc IEDM, 379,2003 energy scavenging will need to he integrated in a single and very small package. [Roozeboom]:F. Roozehoom et.al., Mat .Res.Symp.Proc. B, 783 (2003) [Rijks]: T. Rijks et.al., MIEL 2004 Conclusions [Natale]: J. De Natale, Tech.Dig-ISSCC, 310,2004 In conclusion, it is clear that the field is abundant in challenges in each of the three “power” nodes.

For the ‘%an-node” raw computational power, from few to GOPS, is required and potentially supplied for by the rapid Moore’s law CMOS scaling trend. For this class of devices, the major challenges in Si process

10