COVER FEATURE REBOOTING COMPUTING

Energy-Efficient Abundant-Data Computing: The N3XT 1,000×

Mohamed M. Sabry Aly, Mingyu Gao, Gage Hills, Chi-Shuen Lee, Greg Pitner, Max M. Shulaker, Tony F. Wu, and Mehdi Asheghi, Jeff Bokor, University of California, Berkeley Franz Franchetti, Carnegie Mellon University Kenneth E. Goodson and Christos Kozyrakis, Stanford University Igor Markov, , Ann Arbor Kunle Olukotun, Stanford University Larry Pileggi, Carnegie Mellon University Eric Pop, Stanford University Jan Rabaey, University of California, Berkeley Christopher Ré, H.-S. Philip Wong, and Subhasish Mitra, Stanford University

Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency of abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, and new architectures for computation immersed in memory.

24 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/15/$31.00 © 2015 IEEE he rising demand for high-­ new system technology that promises enabled by low-temperature performance IT services with to breathe new life into computing. layer transfer techniques. This human-like interfaces is driv- Key N3XT components include the unique approach decouples ing the quest for the next gen- following: high-temperature nanoma- Teration of energy-efficient computers. terial synthesis (to achieve These computers will operate on abun- ››High-performance and energy-­ high-­quality materials) from dant data that can be highly unstruc- efficient field-effect transistors low-temperature monolithic 3D tured and often streamed in terabytes. (FETs) based on atomic-scale integration. Abundant-data workloads arise from nanomaterials, such as 1D car- ››Embedded cooling technologies social networks, e-commerce transac- bon nanotubes (CNTs) and 2D targeting a range of application tions, genome sequences, and multi- layered semiconductors. domains (for example, hand- media analytics. Within 10 years, tril- ››Massive amounts of nonvola- held versus servers) to overcome lions of sensors will be connected to tile storage such as low-­voltage power density challenges. Exam- the Internet, creating a massive data resistive RAM (RRAM) and ples include conduction using deluge that could overwhelm commu- magnetoresistive memories 2D materials, management of nication bandwidths. Computers must such as spin-transfer torque thermal transients based on be able to process, understand, classify, magnetic RAM (STT-MRAM). phase change, and convective and organize relevant data in real time These diverse technologies offer copper nanomesh structures and in an energy- and cost-efficient complementary tradeoffs among connected to chip periphery manner. high density, quick access, long microfluidics. The slowdown of silicon CMOS data retention, and read/write ››New microarchitectures and sys- (Dennard) scaling has prompted com- endurance. Their advantages tem runtimes for scalable com- prehensive research on faster, more can be successfully utilized putation immersed in memory energy-efficient switches. However, and their drawbacks avoided that lead to massive amounts of better switches alone will not deliver through a carefully designed active data, enabled by the above the necessary leaps in performance. In memory hierarchy and tight technology components and particular, abundant-data applications expose gross inefficiencies in tradi- tional architectures, where poor local- ity leads to excessive cache misses, N3XT PROMISES MAJOR ENERGY-DELAY causing massive and slow off-chip traf- fic to pin-limited DRAMs that face their PRODUCT BENEFITS FOR WIDE-RANGING own scaling challenges. Thus, only APPLICATIONS, ESPECIALLY ABUNDANT- small fractions of time and energy of DATA WORKLOADS. the system are responsible for compu- tation itself, presenting an opportunity for major improvements. integration with computation their fine-grained integration. N3XT: AN END-TO-END units. Cross-layer resilience techniques APPROACH ››Fine-grained (monolithic, for overcome yield and reliability Our Nano-Engineered Computing example) 3D integration of com- challenges. Systems Technology (N3XT) approach puting and memory elements capitalizes on several recent nanotech- with ultradense connectivity We demonstrate the effectiveness nology breakthroughs (see Figure 1). between layers. Such fine- of N3XT by using the system-level Instead of focusing solely on improv- grained monolithic 3D integra- energy-delay product (EDP) metric— ing transistors or memory cells, N3XT tion is natural to the N3XT tran- the product of a software program’s adopts an integrated approach for a sistor and memory technologies, total energy consumption and total

DECEMBER 2015 25 REBOOTING COMPUTING

Experimental demonstrations (3) Fine-grained monolithic 3D integration (a) 3D RRAM - Compute + memory elements - Ultradense connectivity using nanoscale vias RRAM cells (2) High-density nonvolatile memories (b) Ef cient heat removal solutions - 3D RRAM: massive storage - STT-MRAM: quick access (1) Energy-efcient FETs Thermal storage (copper + - 1D CNTs nanomesh and phase change) 3 µm - 2D layered nanomaterials

(c) Monolithic 3D “high-rise chip” (5) Computation Logic Memory immersed in memory Memory

Logic (4) Efcient heat removal

On-chip nanoconvection/ conduction solutions

FIGURE 1. Monolithically integrated 3D system enabled by Nano-Engineered Computing Systems Technology (N3XT). On the right are the five key N3XT components. On the left are images of experimental technology demonstrations: (a) transmission electron micro­ scopy (TEM) of a 3D resistive RAM (RRAM) for massive storage, (b) scanning electron microscopy (SEM) of nanostructured materials for efficient heat removal (left: microscale capillary advection; right: copper nanomesh with phase-change thermal storage), and (c) SEM of a monolithic 3D chip for high-performance and energy-efficient computation. CNTs: carbon nanotubes, FETs: field-effect transistors, and STT-MRAM: spin-transfer torque magnetic RAM.

execution time—subject to power den- key performance metrics among com- Atomically thin logic devices sity constraints. Given that speed can ponents. Additional synergies arise; for N3XT logic devices capitalize on the be traded for energy and vice versa, the example, faster memory accesses cut unique properties of atomic-scale EDP metric is important in quantifying core idle times, reducing energy con- nanomaterials including 1D CNTs and computing system performance.1 To sumption and overall execution time. 2D layered semiconductors (for exam- enable new frontiers of abundant-data Additional improvements arise from ple, black phosphorus, WSe2). These applications for both mobile devices ultradense monolithic 3D integration nanomaterials are ideal for building and the cloud, we target EDP improve- with fine-grained connectivity and highly scaled FETs that can deliver ments by 1,000×. For traditional multi­ increased memory bandwidth, enabling large drive currents at low supply processor workloads, N3XT targets many concurrent memory accesses and voltages. Such FETs exhibit excel- 10×–100× EDP benefits. As we show, significantly reducing memory access lent electrostatic control (resulting N3XT experimental proto­types can be contention; and from nonvolatile mem- from atomically thin 1D CNTs with built today. ories, which dramatically reduce idle approximately 1-nm diameter and 2D Such significant benefits are gener- energy consumption and simplify layered semiconductors) while simul- ally rare, and cannot be achieved with memory access mechanisms. taneously achieving excellent carrier evolutionary improvements in archi- transport. tectures, transistors, or memory cells N3XT TECHNOLOGY CNTs are hollow cylindrical nano- alone—an end-to-end approach such FOUNDATIONS structures of carbon atoms with excep- as N3XT is essential. Take, for example, Table 1 summarizes the primary tional electrical, thermal, and mechan- the total delay of a processor pipeline nano­technologies that form the ical properties. A carbon nanotube or the total energy of processor cores foundations of N3XT. They work syn- FET (CNFET) consists of multiple CNTs and memories, where each component ergistically to overcome the limita- connected in parallel to form the tran- must show comparable improvement. tions of existing approaches while sistor channel (see Figure 2a). CNFETs N3XT improves each component and meeting application-level thermal promise an order-of-­magnitude better finds symbiotic relations to enhance constraints. EDP versus silicon CMOS at the digital

26 COMPUTER WWW.COMPUTER.ORG/COMPUTER TABLE 1. Nano-Engineered Computing Systems Technology (N3XT) technology foundations.

Impact on

Technology Computation Storage Memory access

Field-effect transistors: 1D carbon Highly energy-efficient digital NA Energy-efficient memory nanotubes and 2D layered semiconductors systems (including logic and controllers and peripheral interconnects) circuits

Emerging Spin-transfer torque NA Quick access, high endurance No refresh; simple control; nonvolatile magnetic RAM energy-efficient management memory by turning off unused banks 3D resistive RAM NA Very high density, long retention

Fine-grained (monolithic) 3D integration Computation immersed in Massive on-chip storage High bandwidth and low latency memory Integration of heterogeneous High computation density for a memory technologies given footprint

Thermal solutions High-performance computing Minimized temperature-induced NA on all tiers degradation

system level, including interconnect Sub-nanometer-t­hin 2D layered semi­ (SRAM) cells, resulting in increased parasitics.2 Nevertheless, until recent­ conductors could enable similar gate memory capacity for the same foot­ ly,­ imperfections and variations inhe­ scaling as CNFETs, and offer more print. A novel device concept for mag­ rent to CNTs (for example, mispo­ degrees of freedom to optimize edge netic memory and logic, the m-Cell (an sitioned CNTs and semiconducting versus surface carrier injection at con­ access transistorless spintronic mem­ versus metallic CNTs) posed major tacts. Two-dimensional layered mate­ ory cell; see Figure 2e), has demon­ obstacles and prevented demonstra­ rials have also been synthesized over strated potential for sub-100-mV tions of large-scale digital systems. large substrates but encounter the operation.9 An even lower energy of Considerable progress has been made challenge of coexisting monolayer and operation might be enabled using recently toward full wafer-scale CNFET- few-layer domains. For both 1D CNTs spin-Hall effect switching. These char­ based digital systems. A combination of and 2D semiconductors, the high-­ acteristics make emerging magnetic CNFET circuit design and CNT process­ temperature synthesis process (to memories promising candidates for ing techniques, the ­imperfection-immune achieve high quality) can be decoupled ultra-low-power embedded memory paradigm, overcomes the challenges of from low-temperature layer transfer, layers very close to computing layers CNT imperfections and variations in a thus enabling dense monolithic 3D (see Figure 1). VLSI-­compatible manner.3 This enabled integration. For high-capacity storage, metal-­ the first experimental demonstration of N3XT applies to other logic switch oxide RRAM is a leading candidate.8 the CNT computer4 (see Figure 2c) and, candidates, such as tunneling FETs It can be programmed at 1–2 V with more generally, arbitrary CNFET dig­ or negative-capacitance FETs, as long currents from nanoamperes to tens of ital systems. These are the first nano­ as they provide high drive currents microamperes and 10-year retention. system demonstrations among various and low leakage, are scalable to device Researchers have achieved endur­ promising emerging nanotechnologies pitches like CNFETs, and can be inte­ ance through 1012 cycles and demon­ for high-performance and energy-effi­ grated in a fine-grained fashion akin strated a sub-10-nm RRAM device cient digital systems. Recent work has to monolithic 3D. (see Figure 2g). Recently, researchers also demonstrated exceptionally scal­ also demonstrated bit-cost scalable able CNFETs with sub-10-nm channel Emerging nonvolatile memories 3D RRAM architectures that are fab­ lengths,2 complementary n-type and STT -MRAM (see Figure 2d) can be pro­ ricated akin to 3D NAND flash (see p-type CNFETs,5 approaches to overcome grammed at low voltages (<0.5 V) with Figure 2f). For a future half-pitch of 5 contact resistance challenges,6 and high-­ tens of microamperes, can attain read/ nm, a 128-tier 3D RRAM is projected to performance CNFETs with CNT densities write access times in few tens of nano­ yield 64 Tbits, programmed at 1 V and of >100 CNT/µm (see Figure 2b). seconds (with potential for another 10× 1 μA current, with 5-ns access time FETs based on 2D layered semi­ speedup), and can offer almost infinite and 109 write cycles, thus enabling conductors are presently less endurance.8 Moreover, STT-MRAM ultra-high capacity on-chip stor­ advanced than CNFETs, but the poten­ cells (approximately 6–20 F2) are sub­ age. Various research groups have tial and challenges are evident.7 stantially smaller than static RAM achieved these specifications at the

DECEMBER 2015 27 REBOOTING COMPUTING

CNFET logic Monolithic 3D integration Sublithographic pitch (b) (h) Logic CNTs (i) (a) (CNFETs) Memory (RRAM) Memory Gate all-around 400 nm (RRAM) (j) Logic (c) Instruction fetch Data fetch Arithmetic block Write-back (Silicon FETs) High-density nanoscale ILVs

Emerging memory technologies Complementary monolithic 3D Top electrode Top electrode (l) Magnetic tunnel Metal oxide (k) Layer 2 In junction (MTJ) Bottom electrode (n) Out Bottom electrode Layer 1 (g) Top electrode (f) Layer 3 (d) 3 Gain≈19 Select Layer 2

(V) 2

lines Word Layer 1 OUT

V 1 lines MTJ 0 0 1 2 3 Bottom electrodes Bit lines VIN (V)

FIGURE 2. N3XT technology foundations. (a) Gate-all-around CNT field-effect transistor (CNFET). (b) Atomic force microscopy (AFM) image of a high-performance CNFET with >100 CNTs/µm. (c) Turing-complete microprocessor built entirely using CNFETs. (d) Spin-transfer torque magnetic RAM. (e) TEM cross-section image of a fabricated m-Cell (an access transistorless spintronic memory cell). (f) High-density 3D RRAM. (g) TEM cross-section image of an RRAM cell with sub-10-nm feature size. (h) Monolithic 3D integrated circuit (IC) with four vertical layers (logic, memory, memory, and logic). (i) TEM cross-section image of a CNFET on the fourth layer of a monolithic 3D IC described in (h). (j) TEM cross-section image of the middle two layers of RRAM from (h). (k) TEM cross-section image of the bottom layer of silicon FETs from (h). (l) Microscopy image of a wafer with three stacked layers of CNFET-based logic. (m) TEM cross-section image of (l), showing the three vertically stacked layers of CNFET logic. (n) Schematic with an SEM image, and measured waveform of one such fully complementary CNFET 3D logic circuit (p-channel metal-oxide semiconductor CNFET on layer 2 above the n-channel metal-oxide semiconductor CNFET on layer 1).

single-device level. Future challenges computing and memory-access cir- Thermal solutions include developing appropriate selec- cuits adjacent to memory arrays. Thus, Effective thermal solutions are essen- tors, reducing device variations, and memory access latency and energy tial for reasons ranging from preven- productizing integration technolo- are reduced. Moreover, device den- tion of thermal runaway to main- gies for 100-plus 3D RRAM layers. sity per unit footprint increases with tenance of low skin temperature additional layers despite 2D scaling for mobile and wearable systems. Fine-grained 3D integration difficulties. System-­level temperature manage- To achieve the massive EDP benefits Monolithic 3D integration requires ment requires careful electrothermal offered by N3XT, we must densely low-temperature fabrication for the codesign. Thermal solutions for high-­ interweave computation elements upper tiers (<400°C). This is gener- performance computing platforms and memory. Such integration is real- ally difficult for silicon technologies will require unique micro/nano heat ized by monolithically stacking tiers but comes naturally with N3XT tech- convection and conduction solutions. of logic and memory. Consecutive nologies. Monolithic 3D integration Embedded cooling technologies might tiers are connected using nanoscale with vertically interleaved layers of combine solid-state energy storage11 interlayer vias (ILVs; used for wire logic and memory in arbitrary order and conduction media, including routing), which contrasts sharply has been experimentally demon- novel 2D materials.12 Thermal solu- with traditional 3D integration using strated (see Figures 2h–2k), leverag- tions can also leverage novel micro/ through-silicon vias (TSVs). ILVs ing CNFETs for logic layers and RRAM nano­fluidic cooling, both chip-internal enable 1,000-fold denser vertical con- for memory layers.10 Importantly, and chip-external, depending on the nectivity than TSVs, which is key to these hardware prototypes have been heat flux densities handled (for exam- greater energy efficiency. To maxi- fabricated directly over a starting sil- ple, mobile versus server applications). mize the benefits of monolithic 3D icon substrate, demonstrating that Advanced convective structures such integration, logic and memory layers the N3XT approach is compatible with as copper nanomeshes and tree-like must be vertically interleaved to build today’s silicon technologies. structures (see Figure 1b) can handle

28 COMPUTER WWW.COMPUTER.ORG/COMPUTER 64 Gbytes off-chip DRAM 120 cycles read/write 52 pJ/bit read/write

8 DDR3 memory controllers

32 Mbyte silicon SRAM L2 cache 32 Kbytes silicon SRAM L1 data cache 8-way set associative 8-way set associative 23 cycles read/write 4 cycles read/write, 1.25 pJ/bit read/write 2.06 pJ/bit read/write Processing core 32 Kbytes silicon SRAM L1 instruction cache 64 in-order 22 nm silicon cores 4-way set associative 3 cycles read/write, 0.31 pJ/bit read/write Baseline system

(a)

64 Gbytes on-chip 3D RRAM 64 tiers (1 Gbyte/tier) 3D RRAM 5 ns read + 11 ns write 2 pJ/bit read + 6 pJ/bit write ILVs CNFET access transistors 64 memory controllers Peripheral CNFET circuits Simple custom interface STT-MRAM cells 256 Mbyte STT-MRAM L2 cache ILVs + ring interconnect CNFET access transistors 8-way set associative Peripheral CNFET circuits 1 ns read + 5 ns write 32 Kbytes CNFET SRAM L1 data cache 1.17 pJ/bit read/write 8-way set associative 4 cycles read/write, 0.41 pJ/bit read/write 64 in-order Processing core 22 nm CNFET cores 32 Kbytes CNFET SRAM L1 instruction cache 4-way set associative N3XT system 3 cycles read/write, 0.11 pJ/bit read/write (b)

FIGURE 3. Baseline (a) and N3XT system (b) configurations. ILVs: interlayer vias. heat flux densities from 10 W/cm2 to 5 example, 1D and 2D FETs, STT-MRAM, Abundant-data kW/cm 2. The copper matrix could also RRAM, monolithic 3D, and cooling— multicore workloads encapsulate thermal phase-change might produce more accurate values of We examined a range of multicore materials like paraffin to suppress ther- physical parameters used for our simu- workloads—for example, PARSEC, mal transients and maintain system lations. In the meantime, we used the Powergraph, and IBM Graph analytics temperature constraints.11 most accurate values available, vali- benchmarks (http://systemg.research dated by hardware experiments where .ibm.com/analytics.html)—to thor- N3XT BENEFITS possible. oughly assess N3XT’s EDP benefits. The Our N3XT approach enables mas- We performed detailed physical observed EDP gains ranged from 10× for sive EDP benefits for a wide range of design using place-and-route tools computation-bound applications in tra- applications. To demonstrate this, and carefully checked the routability, ditional multicore benchmarks to more we simulated baseline and N3XT sys- timing, and power for both implemen- than 1,000× for abundant-data appli- tem configurations (see Figure 3). tations. We used Zsim (https://github cations—our main target. Our analy- The baseline system is similar to the .com/s5z/zsim) and 3D-ICE (http://esl sis relied on uncustomized software many-core Intel Xeon Phi. The physi- .epfl.ch/3D-ICE) for architectural and implementations and compilers. Fur- cal parameters in Figure 3 are derived thermal simulations, respectively. ther gains might be achieved through from industrial data sheets, CNFET Although the specific technology and careful software and compiler opti- SPICE models (https://nano.stanford architecture selections in Figure 3 mizations but at the cost of increased .edu/stanford-cnfet2-model) calibrated allow us to validate the N3XT princi- software-development­ effort. using experimental CNFET measure- ples through comprehensive simula- Consider the traditional Page­Rank ments, as well as energy and delay esti- tions, these selections are not exclu- application, a key representative work- mation tools. Ongoing investigations sive and the N3XT principles remain load for abundant-data applications of the key N3XT technologies—for general. that is used extensively in Web search

DECEMBER 2015 29 REBOOTING COMPUTING

TABLE 2. N3XT energy-delay product (EDP) benefits for simple single-core workloads.

Execution time* Energy consumption N3XT EDP Workload C-style code System CPU Memory access CPU Memory access benefit

Computation- For (t = 0; t ≤ 1; t +=0.05) { Baseline 0.999 0.001 0.960 0.040 9.96× dominated x1 = pow(t, x–1); y1 = pow(1–t, y–1); N3XT 0.333 0.0006 0.297 0.0031 z+ = x1*y1; }

Sequential For (i = 0; i < MAX_ITER; i++) Baseline 0.400 0.600 0.280 0.720 66× read/writes y[i] = x[i]; N3XT 0.130 0.060 0.070 0.007

*Execution time and energy consumption values are normalized to the corresponding total values in the baseline case.

and social networks (benchmarked energy-efficient logic implemented cores and storage units to improve using Stanford Network Analysis Proj- using CNFETs. Multicore workloads resource density and communication ect’s 16-Gbyte input dataset; http:// create even more opportunities bandwidths. Further research oppor- snap.stanford.edu). We used a gather-­ for additional benefits. For multi- tunities include effective integration apply-scatter graph-parallel model with core workloads, the processor cores of multiple N3XT chips, thermal man- edge-centric streaming implementa- must compete for memory accesses. agement, and corresponding system-­ tion, which arranges the edges consec- These access contentions cause major architecture optimizations. utively in memory and optimizes for energy-­efficiency and performance sequential memory accesses. For Page­ bottlenecks. N3XT overcomes this Simple single-core workloads Rank, N3XT improves EDP by 850×: a by providing much greater memory We illustrated N3XT benefits using simultaneous 23× application speedup bandwidth through fine-grained 3D two simple workloads executed on a and 37× application energy reduction integration; we utilize this opportu- single processor core of the baseline (see Figure 4). The average power den- nity by using 64 memory controllers and N3XT systems: a computation-­ sity for the N3XT system is 67 W/cm2, (see Figure 3) that enable many con- dominated kernel (beta function evalu- and the peak temperature is 63°C (ver- current memory accesses in the N3XT ation) and sequential memory accesses sus 65 W/cm2 and 61°C for baseline). system. Such concurrency is essential (see Table 2). These simple workloads The N3XT EDP benefits for Page­ to efficient execution of abundant-­ provide insights into the sources of the Rank can be further improved to data applications with nonlocal data previously mentioned benefits in the 1,105× (simultaneous 65× application accesses. N3XT system. Even for simple work- speedup and 17× application energy We also estimated the impact of ­ loads executing on a single core, the reduction), but at the cost of increased ­I/Os (we focus on off-chip data trans- N3XT system shows significant ben- power density and peak temperature. fer from a socket to a different process- efits. The computation-­dominated Hence, N3XT thermal solutions are ing socket) on N3XT EDP benefits for workload spends very little energy or essential in this context. abundant-data applications. We dis- time on memory accesses and improves To put the N3XT benefits into tributed the input (graph) data evenly EDP 10×, mainly owing to CNFETs. For per­spective, TSV-based stacked 3D across memories in different sock- the memory-access workload, memory processor-­in-memory (PIM) architec- ets, connected by the Intel QuickPath accesses dominate execution time and ture with 22-nm silicon CMOS, with 8 Interconnect interface. For N3XT con- energy consumption for the baseline 3D wide I/O interface (www.jedec.org figurations without sufficient on-chip system. However, for the N3XT system, /­standards-documents/docs/jesd229) memory, we observed EDP benefits of the processor core dominates execution memory channels, provides only 16× 613× for PageRank. With growing data time and energy consumption. This is EDP benefits for PageRank. We con- volumes and processing rates, propor- due to the improved memory system firmed similar N3XT benefits across tionately scaling conventional sys- resulting from monolithic integration other graph-processing workloads tems’ resources and expanding them of 3D RRAM. from IBM Graph analytics benchmarks. in two dimensions can be costly and The observed N3XT EDP bene- risky. N3XT offers a strikingly different SYSTEM-LEVEL fits arise from the close proximity of path to large-scale computation tightly CONSIDERATIONS computation and memory via mono- integrated with high-capacity mem- We now discuss three important N3XT lithic 3D integration, in addition to ory: vertically stacking computing aspects: ensuring acceptable yield and

30 COMPUTER WWW.COMPUTER.ORG/COMPUTER PageRank: 850x EDP Benet Energy consumption Baseline N3XT Core active 10.8% 1.74% Baseline Core idle 46.3% 0.03% Caches 8.8% 0.46% N3XT 37x energy reduction Memory 34.1% 0.24% 0% 100% (a)

Execution time reliability, positioning N3XT in the Baseline N3XT context of key energy-efficient com- Baseline Active 3% 3.57% puting concepts, and the program- Idle 97% 0.8% mability and hardware-software co-­ N3XT 23x speedup optimization of the N3XT architecture. 0% 100% (b) Yield and reliability Fine-grained 3D integration requires FIGURE 4. PageRank benchmark workload on the baseline and N3XT systems: (a) energy a deep understanding of variability, consumption and (b) execution time. N3XT improved energy-delay product (EDP) by yield, and reliability, as well as tech- 850×: a simultaneous 23× application speedup and 37× application energy reduction. niques to manage them at the device, circuit, and architecture levels. For example, the imperfection-immune frequency scaling (DVFS) or power languages (DSLs) with proper compiler paradigm overcomes substantial gating can also be applied to N3XT sys- support provide effective approaches imperfections inherent in CNTs.3 tems. Application-specific integrated for such co-optimization,15 as well as Additionally, effectively exploring circuits (ASICs) and accelerator-­rich efficient mapping of abundant-data the interplay between CNT varia- heterogeneous computing architec- applications onto N3XT hardware tions and circuit-level energy, delay, tures can utilize energy-efficient device architecture. Key elements of such co-­ noise margin, and functional yield concepts in N3XT to achieve further optimization include enables co-optimization of CNT pro- benefits. Although hardware special- cessing and CNFET circuit design ization through accelerators enhances ››DSLs that provide high-level techniques that overcome CNT vari- computing systems’ energy efficiency software abstractions for data ations with <10 percent circuit-level (compared with programmable pro- transformation (data wran- EDP impact.13 Massive integration of cessors), inadequate data accessibil- gling), data querying, data nonvolatile memories requires simi- ity (for example, few memory access feature generation, machine lar strategies and new error-correction ports or small memory capacity) limits learning, graph analysis, and techniques that are aware of failure their systemwide effect.14 Fine-grained visualization; and modes. Various integration technolo- accesses to many memory arrays in ››compilers that translate the gies for dense 3D can offer promising N3XT overcome such limitations, high-level DSL abstractions into yield-improvement opportunities, for which in turn boosts the performance optimized code. example, through intermediate test- benefits of hardware specialization. ing of various substrates before inte- DSL compilers optimize compu- grating them. At the system level, the Hardware–software tation and improve memory locality, error-tolerant nature of abundant-data co-optimization and the optimized code can then be applications and algorithms and the Whereas the results reported earlier managed by software and hardware distributed nature of large-scale archi- are for uncustomized implementa- techniques. Such runtime support tectures create powerful opportuni- tion of abundant-data applications, can manage task distribution, com- ties for tolerating hardware failures careful codesign and co-optimization munication, synchronization, power using techniques at the application, of N3XT software and hardware sys- consumption, and fault tolerance in architecture, and circuit levels. tems can enable even higher energy N3XT nanosystems. Along with exten- efficiency and performance. The key sive user-level programmability, soft- Energy-efficient is achieving this objective at reason- ware optimizations, and reuse of stan- computing perspective able development costs and time. For dard software and hardware modules, N3XT is compatible with various tech- example, significant gains are possi- DSL compilers also offer automatic niques for energy-efficient computing. ble through algorithm–architecture microarchitecture selection, as well as For example, runtime adaptive tech- codesign that explores a very large comprehensive word-level and some niques such as dynamic voltage and space of candidates. Domain-specific bit-level optimizations.

DECEMBER 2015 31 REBOOTING COMPUTING

ABOUT THE AUTHORS

MOHAMED M. SABRY ALY is a postdoctoral scholar in elec- at Stanford University. Contact him at [email protected]. trical engineering at Stanford University. Contact him at [email protected]. CHRISTOS KOZYRAKIS is an associate professor of electri- cal engineering and at Stanford University. MINGYU GAO is a PhD candidate in at Contact him at [email protected]. Stanford University. Contact him at [email protected]. IGOR MARKOV is a professor of electrical engineering and GAGE HILLS is a PhD candidate in electrical engineering at computer science at the University of Michigan, Ann Arbor. Stanford University. Contact him at [email protected]. Contact him at [email protected].

CHI-SHUEN LEE is a PhD candidate in electrical engineering KUNLE OLUKOTUN is the Cadence Design Systems Profes- at Stanford University. Contact him at [email protected]. sor of electrical engineering and computer science at Stan- ford University. Contact him at [email protected]. GREG PITNER is a PhD candidate in electrical engineering at Stanford University. Contact him at [email protected]. LARRY PILEGGI is the Tanoto Professor of Electrical and Computer Engineering at Carnegie Mellon University. Con- MAX M. SHULAKER is a PhD candidate in electrical engineer- tact him at [email protected]. ing at Stanford University. Contact him at maxms@stanford .edu. ERIC POP is an associate professor of electrical engineering at Stanford University. Contact him at [email protected]. TONY F. WU is a PhD candidate in electrical engineering at Stanford University. Contact him at [email protected]. JAN RABAEY is the Donald O. Pederson Distinguished Pro- fessor of Engineering in the Department of Electrical Engi- MEHDI ASHEGHI is a consulting associate professor of neering and Computer Sciences at the University of Califor- mechanical engineering at Stanford University. Contact him nia, Berkeley. Contact him at [email protected]. at [email protected]. CHRISTOPHER RÉ is an assistant professor of computer JEFF BOKOR is the National Semiconductor Distinguished science at Stanford University. Contact him at chrismre@cs Professor of Engineering in the Department of Electrical Engi- .stanford.edu. neering and Computer Sciences at the University of Califor- nia, Berkeley. Contact him at [email protected]. H.-S. PHILIP WONG is the Willard R. and Inez Kerr Bell Profes- sor in the School of Engineering and a professor of electrical FRANZ FRANCHETTI is an associate research professor of engineering at Stanford University. Contact him at hspwong@ electrical and computer engineering at Carnegie Mellon Uni- stanford.edu. versity. Contact him at [email protected]. SUBHASISH MITRA is an associate professor of electrical KENNETH E. GOODSON is the Bosch Chairman and the engineering and computer science at Stanford University. Davies Family Provostial Professor of Mechanical Engineering Contact him at [email protected].

3XT promises major EDP ben- 21st century grand challenges. N3XT and memories, fine-grained 3D inte- efits for wide-ranging appli- is urgent because existing system gration, cooling solutions, computer Ncations, especially abundant-­ technologies and architectures have architecture, and software. Although data workloads presented by big-data hit major obstacles; as a result, the significant research is required to processing and the myriad sensors future of computing faces formidable realize the massive N3XT benefits that produce a massive data deluge. challenges, especially with the slow- we outlined, experimental hardware N3XT enables unprecedented comput- down of traditional integrated circuit prototypes and detailed simulations ing capabilities. It is a major IT leap and scaling. using physical layouts and hardware-­ is crucial for addressing several of the N3XT is an integrated approach calibrated models clearly indicate National Academy of Engineering’s spanning emerging logic switches that N3XT’s technical challenges are

32 COMPUTER WWW.COMPUTER.ORG/COMPUTER tractable. The speci c NXT imple- . M.M. Shulaker et al., “Carbon Nano- Systems, vol. , no. , , mentation details presented here are tube Computer,” Nature, vol. , pp. –. not exclusive, but they can guide the , pp. –. . J. Hegarty et al., “Darkroom: Com- design of more compelling systems. . L. Suriyasena et al., “VLSI- piling High-Level Image Processing We primarily focused on von Compatible Carbon Nanotube Code into Hardware Pipelines,” ACM Neumann–style computing platforms Doping Technique with Low Work- Trans. Graphics, vol. , no. , , to capitalize on the large body of soft- Function Metal Oxides,” Nano Lett., article . ware technologies and their upcoming vol. , no. , , pp. –. . K.J. Brown et al., “A Heterogeneous enhancements. However, we expect . Q. Cao et al., “End-Bonded Contacts Parallel Framework for Domain- that several NXT features, including for Carbon Nanotube Transistors Speci c Languages,” Proc. Int’l Conf. energy-e cient logic, massive mem- with Low, Size-Independent Resis- Parallel Architectures and Compilation ory capacity, and densely integrated tance,” Science, vol. , no. , Techniques (PACT ), , pp. –. computation and memory, will signi - , pp. –. cantly in uence architectures target- . G. Fiori et al., “Electronics Based ing other computation models, such as on Two-Dimensional Materials,” brain-inspired architectures. Nature Nanotechnology, vol. , , pp. –. . H.-S.P. Wong and S. Salahuddin, ACKNOWLEDGMENTS “Memory Leads the Way to Better This work was supported in part by the Computing,” Nature Nanotechnology, National Science Foundation, the STAR- vol. , , pp. –. Net Systems on Nanoscale Information . D.M. Bromberg et al., “All-Magnetic fabriCs Center (one of six SRC STARnet Magnetoresistive Random Access Centers sponsored by Microelectronics Memory Based on Four Terminal Advanced Research Corp. and DARPA), mCell Device,” J. Applied Physics, the Swiss National Science Foundation vol. , ; doi:./.. Early Postdoc.Mobility Fellowship  . M.M. Shulaker et al., “Monolithic D to Mohamed Sabry Aly, the Hertz/SGF to Integration of Logic and Memory: Max Shulaker, and the Stanford Univer- Carbon Nanotube FETs, Resistive sity SystemX Alliance. We also acknowl- RAM, and Silicon FETs,” Proc. IEEE edge IBM Research investigators for their Int’l Electron Devices Meeting (IEDM collaboration on CNFET modeling. ), , pp. ..–... . M. Fuensanta et al., “Thermal Prop- REFERENCES erties of a Novel Nanoencapsulated NEXT ISSUE . R. Gonzalez and M. Horowitz, Phase Change Material for Thermal OUTLOOK “Energy Dissipation in General Pur- Energy Storage,” Thermochimica pose Microprocessors,” IEEE J. Solid- Acta, vol. , , pp. –. State Circuits, vol. , no. , , . E. Pop, V. Varshney, and A.K. Roy, pp. –. “Thermal Properties of Graphene: . A.D. Franklin et al., “Sub- nm Car- Fundamentals and Applications,” bon Nanotube Transistor,” Nano Lett., MRS Bull., vol. , no. , , vol. , no. , , pp. –. pp. –. . J. Zhang et al., “Carbon Nanotube . G. Hills et al., “Rapid Co-Optimization Robust Digital VLSI,” IEEE Trans. of Processing and Circuit Design to Selected CS articles and Computer-Aided Design of Integrated Overcome Carbon Nanotube Vari- columns are also available for Circuits and Systems, vol. , no. , ations,” IEEE Trans. Computer-Aided free at http://ComputingNow .computer.org. , pp. –. Design of Integrated Circuits and

DECEMBER 2015 33