To Appear in Proc. of The 28th Annual International Symposium on Computer Architecture, June 2001.

NanoFabrics: Spatial Computing Using Molecular Electronics

Seth Copen Goldstein and Mihai Budiu

Carnegie Mellon University g fseth,mihaib @cs.cmu.edu

Abstract compile time (which is inexpensive) for manufacturing pre- cision (which is expensive). We achieve this through a com- The continuation of the remarkable exponential in- bination of reconfigurable computing, defect tolerance, ar- creases in processing power over the recent past faces immi- chitectural abstractions and compiler technology. The result nent challenges due in part to the of deep-submicron will be a high-density low-power substrate which will have CMOS devices and the costs of both chip masks and future inherently lower fabrication costs than CMOS counterparts. fabrication plants. A promising to these problems Using EN to build computer systems requires new ways is offered by an alternative to CMOS-based computing, of thinking about computer architecture and compilation. chemically assembled electronic (CAEN). CAEN differs from CMOS: CAEN is extremely unlikely In this paper we outline how CAEN-based computing to be used to construct complex aperiodic structures. We can become a reality. We briefly describe recent work introduce an architecture based on fabricating dense regu- in CAEN and how CAEN will affect computer architec- lar structures, which we call nanoBlocks, that can be pro- ture. We show how the inherently reconfigurable nature grammed after fabrication to implement complex functions. of CAEN devices can be exploited to provide high-density We call an array of connected nanoBlocks a nanoFabric. chips with defect tolerance at significantly reduced manu- Compared to CMOS, CAEN-based devices have a facturing costs. We develop a layered abstract architecture higher defect density. Such circuits will thus require built-in for CAEN-based computing devices and we present prelim- defect tolerance. A natural method of handling defects is to inary results which indicate that such devices will be com- first configure the nanoFabric for self-diagnosis and then to petitive with CMOS circuits. implement the desired functionality by configuring around 1 Introduction the defects. Reconfigurabilty is thus integral to the oper- ation of the nanoFabric. Their nature makes nanoFabrics We are approaching the end of a remarkably success- particularly well suited for reconfigurable computing. ful era in computing: the era where Moore’s Law reigns, Reconfigurable computing changes as needed the func- where processing power per dollar doubles every year. This tion of programmable logic elements and their connections success is based in large part on advances in complemen- to storage, building efficient, highly parallel processing ker- tary metal-oxide semiconductor (CMOS)-based integrated nels, tailored for the application under execution. The net- circuits. Although we have come to expect, and plan for, work of processing elements is called a reconfigurable fab- the exponential increase in processing power in our every- ric. The data used to program the interconnect and process- day lives, today Moore’s Law faces imminent challenges ing elements is called a configuration. Examples of current both from the physics of deep-submicron CMOS devices reconfigurable fabrics are commercial Field Programmable and from the costs of both chip masks and next-generation Gate Arrays (FPGAs) such as [39, 2], and research proto- fabrication plants. types, e.g. Chimaera [40] and PipeRench [18]. As we show A promising alternative to CMOS-based computing un- later, one advantage of nanoFabrics over CMOS-based re- der intense investigation is chemically assembled electronic configurable fabrics is that the area overhead for supporting nanotechnology (CAEN), a form of electronic nanotechnol- reconfiguration is virtually eliminated. This will magnify ogy (EN) which uses self-alignment to construct electronic the benefits of reconfigurable computing, yielding comput- circuits out of nanometer-scale devices that take advantage ing devices that may outperform traditional ones by orders of quantum-mechanical effects [10, 30]. In this paper we of magnitude in many metrics, such as computing elements

show how CAEN can be harnessed to create useful compu- per cm ¾ and operations per watt. ½¼ tational devices with more than ½¼ gate-equivalents per In the next section we present some recent research re- cm¾ . The fundamental strategy we will use is to substitute sults, which indicate CAEN will be a successful technol- ogy for implementing computing devices. We next analyze out CMOS transistor2. A simple logic gate or an static

how the unique features of CAEN devices can be exploited, memory cell requires several transistors, separate p- and n- 5 and how their limitations can be circumvented. In Sec- wells, etc., resulting in a factor of ½¼ difference in density tion 3 we present an architecture that utilizes the capabil- between CAEN and CMOS. CAEN devices use much less ities of CAEN devices without requiring Herculean fabri- power, since very few electrons are required for switching. cation technology. In Section 4 we describe our top-level CAEN devices are particularly suited for reconfigurable architectural abstraction, the Split-Phase Abstract Machine computing since the configuration information for a switch (SAM), which enables fast compilation of large programs. does not need to be stored in a device separately from the The simulation results in Section 5 indicate that the SAM switch itself [10]. On the other hand, A CMOS-based re- abstraction does not hide the efficiency of the nanoFabric. configurable device requires a static RAM cell to controls each pass transistor. Also, two sets of wires are needed in 2 Electronic Nanotechnology CMOS: one for addressing the configuration bit and one for the actual signals. Perhaps a more realistic comparison of While CMOS fabrication will soon hit a wall due CAEN is to floating-gate technology which also stores the to a combination of economic and technical factors, 1 configuration information at the transistor itself. Like tradi- we are nowhere near the theoretical limits of physical tional transistors, floating-gate transistors also require two computation[17]. In order to achieve these limits at room sets of wires. Furthermore, A CAEN switch behaves like temperature we must work in the nanoscale regime, which a diode, but a floating gate transistor is bi-directional, and currently involves various technologies that exploit the thus less useful for building programmable logic. quantum-mechanical effects of small devices. Among these Electronic nanotechnology is quickly progressing and are single-electron transistors [9], nanowire transistors [11], promises incredibly small, dense, and low-power devices. quantum dots [37], quantum cellular automata [23], res- Harnessing this power will require new ways of thinking onant tunneling devices [6], negative differential resistors about the manufacturing process. We will no longer be (NDR) [7], and reconfigurable switches [8, 10]. In all cases, able to manufacture devices deterministically; instead, post- the fabricated devices are on the order of a few nanome- fabrication reconfiguration will be used to determine the ters. Given the small sizes involved, these devices must properties of the device and to avoid defects. be created and connected through self-assembly and self- alignment instead of lithography. 2.1 Fabrication and Architectural Implications In this paper we limit ourselves to molecular devices which have I-V characteristics similar to those of their Here we briefly outline a plausible fabrication process. bulk counterparts. For example, the basis of rectifica- The process is hierarchical, proceeding from basic compo- tion is different between a silicon-based p-n junction diode nents (e.g. wires and switches), through self-assembled ar- and a molecular diode, yet they both have similar I-V rays of components, to complete systems. In the first step, curves [4]. We choose to look at systems that can be built wires of different types are constructed through chemical from nanoscale devices with bulk-semiconductor analogs so self-assembly.3 The next step aligns groups of wires. Also that (1) we can apply our experience with standard circuits through self-assembly, two planes of aligned wires will be to the system and (2) we can model the system with stan- combined to form a two-dimensional grid with configurable dard tools such as SPICE. molecular switches at the crosspoints. The resulting grids While there are still many challenges left in creating will be on the order of a few microns. A separate process fully functional EN computing devices, recent advances in- will create a silicon-based die using standard lithography. dicate that EN could be a very successful post-CMOS tech- The circuits on this die will provide power, clock lines, an nology. Several groups have recently demonstrated CAEN I/O interface, and support logic for the grids of switches. devices that are self-assembled or self-aligned (or both) [10, The die will contain “holes” in which the grids are placed, 28, 16, 32]. Advances have also been made in creating aligned, and connected with the wires on the die. wires out of single-wall carbon nanotubes and aligning Using only self-assembly and self-alignment restricts them on a silicon substrate [36, 29]. Even more practical us to manufacturing simple, regular structures, e.g., rafts is the fabrication of metal nanowires, which scale down of parallel wires or grids composed of orthogonal rafts. to 5nm and can include embedded devices [25, 27]. Com- 2For the CAEN device we assume that the nanowires are on 10nm cen- bined,these advances compel us to investigate further how ters. A CMOS transistor with a 4:1 ratio in a 70nm process, (even using to harness CAEN for computing in the post-CMOS age. Silicon on Insulator, which does not need wells) with no wires attached

CAEN devices are very small: A single RAM cell will measures 210nm x 280nm. Attaching minimally-sized wires to the termi- ¾

¾ nals increases the size to 350nm x 350nm.

½¼¼; ¼¼¼ require ½¼¼nm as opposed to nm for a single laid 3By chemical self-assembly we mean a process by which the com- 1Among the technical issues are ultrathin gate oxides, short channel ponents (e.g., wires or devices) are synthesized and connected together effects, and doping fluctuations [21]. through chemical processes. See [24] for an overview.

2 (a) (b) (c) Single nanoBlock showing I/O Half of a switchblock Switchblock with four surround- lines ing nanoBlocks Figure 1. NanoBlock Connectivity.

A post-fabrication configuration is used to create use- 3 NanoFabric ful circuits (See Section 3.3). The small size and non- deterministic nature of the self-assembly will also give rise to high defect densities, which can be bypassed through re- The architecture of the nanoFabric is designed to over- configuration (See Section 3.2). come the constraints associated with directed assembly of nanometer-scale components and exploit the advantages While researchers have constructed three-terminal EN of molecular electronics. Because the fabrication process devices, the precise alignment required to colocate three is fundamentally non-deterministic and prone to introduc- wires at the device makes them unsuitable for producing ing defects, the nanoFabric must be reconfigurable and real circuits with inexpensive chemical assembly. We thus amenable to self-testing. This allows us to discover the assume that CAEN devices will be limited to performing characteristics of each nanoFabric and then create circuits logic using two terminal devices; i.e. diode-resistor logic that avoid the defects and use the available resources. For (see Section 3.1). As the active components will be diodes the foreseeable future, the fabrication processes will only and configurable switches, there will be no inverters. Be- produce simple, regular geometries. Therefore, the pro- cause we cannot build inverters, all logic functions will gen- posed nanoFabric is built out of simple, two-dimensional, erally compute both the desired output and its complement. homogeneous structures. Rather than fabricating complex circuits, we use the reconfigurability of the fabric to imple- Even more important, the lack of a transistor means that ment arbitrary functions post-fabrication. The construction special mechanisms will be required for signal restoration process is also parallel; heterogeneity is introduced only and for building registers. Using CMOS to buffer the sig- at a lithographic scale. The nanoFabric can be configured nals is unattractive for two reasons: first, CMOS transis- (and reconfigured) to implement any circuit, like today’s tors are significantly larger and would decrease the density FPGAs; the nanoFabric though has several orders of mag- of the fabric. Second, the large size of CMOS transistors nitude more resources. would slow down the nanoFabric. We have succesfully de- The nanoFabric is a 2-D mesh of interconnected signed and simulated a molecular latch motivated by work nanoBlocks. The nanoBlocks are logic blocks that can be in tunnel diodes [1, 26]. The latch is composed of a wire programmed to implement a three-bit input to three-bit out- with two inline NDR at either end. The latch put Boolean function and its complement (see Figure 1a). combined with a clocking methodology, provides signal NanoBlocks can also be used as switches to route signals. restoration, latching, and I/O isolation [31]. The require- The nanoBlocks are organized into clusters (See Figure 2). ment to condition signals will result in circuits which will Within a cluster the nanoBlocks are connected to their near- be either slow (if transistors are used) or deeply pipelined est four neighbors. Long wires, which may span many clus- (if latches are used). ters (long-lines), are used to route signals between clus- ters. The nanoBlocks on the perimeter of the cluster are The fabrication process also disallows the precise align- connected to the long-lines. This arrangement is similar to ment required to make end-to-end connections between commercial FPGAs (allowing us to leverage current FPGA nanoscale wires. Our architecture ensures that all connec- tools) and has been shown to be flexible enough to imple- tions between nanoscale wires occur by crossing the wires. ment any circuit on the underlying fabric.

3 u ihgahcdmnin(.,getrta n micron), one mini- than the greater (e.g, than dimension larger lithographic mum are themselves nanoBlocks adjacent nanoFabric. the the the onto in Since netlist run circuit any NW map the can we and diagonal diagonal SE one the north all in that or run such (SE) blocks blocks the east arranging and By south (NW). west facing and either are and blocks input the the of where area a overlap the wires call to output We connect nanoBlock another. one of of inputs outputs the the (b) how 1 show Figures (c) alignment. and end-to-end precise wires; need is orthogonal not two do 1a we between Figure wire-to- made nanoscale in are all connections arrangement that wire so I/O it the arranged have outputs, We Thus, or required. inputs have both. can not block the but of side Each straints. h aolc eini ittdb arcto con- fabrication by dictated is design nanoBlock The 2. Figure lwpo igecutradsm fteadjacent the of some and cluster single a of blowup long-lines. n west. and south on Outputs east and north on Inputs Control, configuration decompression, & defect mapping seed iue3. Figure Ground Inline NDRs h aoto h aoarcwt partial a with nanoFabric the of layout The C Φ Φ Φ Φ ceai fananoBlock. a of schematic A wthblock switch MLA +Vdd oieta h outputs the that Notice . rmteCO layer. CMOS the from connections indicate regions Stripped oe rmclock. from power with signals restore also that NDRs inline by latched are Outputs C Gnd Φ Φ Φ Φ Inline NDRs switch-block nanoBlock long-lines cluster 4 fceia assembly. nature chemical parallel of the as- exploit can the all assemble we hierarchically for we nanoFabric because devices Finally, operation. molecular circuit requirements use of power pects we The because fabrics. low large remain very reason- for remain even to able times configuration allowing configured be parallel, to designed in is cluster each routability Second, supports netlists. This of of clusters. long- number the between of the run number that as the lines increase First, can we increases ways. components several in scalability mote requirements. power nanoFabric its the lowers of and density the compo- increases again CMOS back and and nanoscale nents between that go fact to The need never communication nanoscale. we the all at that occurs to nanoBlocks Notice amenable between and gen- [5]. efficient This be tools to [33]. place-and-route shown FPGA been style has island layout layout an eral more This of or that switches. one essentially any 8 traverse is through and to going 1,2,4, signal without (e.g., a clusters lengths allowing varying long), in of clusters nanowires be will The low-latency tracks provide these distances. to longer clusters over the communication between run that lines in time manufacturing patterns. at desired precisely the positioned be can they o odcig id ilb owr isd n will and biased, forward be and will biased reverse 8 be their Diode will then 6 conducting. respec- low, and not high 5 is 2, and B 1, low diodes and diode-resistor Thus, are high tively. of B-bar) is typical and A is (A-bar if B complements circuit, example, and This For A of output. logic. of XOR sum portion the the the is of generate schematic which to a used is 5 circuit Figure the half-adder. of a part for top implementation the On the shows 5 Figure while OR, AND, ADDER. e.g. and nanoBlocks, XOR, using for designs library preliminary cell” have “standard We a plane. AND requires an which and array, OR logic an programmable dif- a as significantly for act is than “on”, MLA ferent the be for to circuits configured Designing when diodes. switches, molecular The configurable a switch. lies At wires two wires. of of intersection sets each orthogonal two of composed is nanoBlock neighbors its block. to switch nanoBlock the the through connect (3) to and used area, implementation, I/O circuit sig- the and sequential restoration for signal latching for nal used latches, the the (1) is (2) block 3): located, the Figure of functionality (see the sections where array, three logic of molecular composed is It ric. . NanoBlock 3.1 h ragmn ftecutr n h oglnspro- long-lines the and clusters the of arrangement The long- are there routing intra-cluster the to addition In iue4sosteipeetto fa N gate, AND an of implementation the shows 4 Figure a of portion (MLA) array logic molecular The nanoFab- the of unit fundamental the is nanoBlock The V AND rows can be moved anywhere within the block without af- A B fecting the circuit, which makes defect tolerance signifi- 4 A cantly easier than with CMOS . The number of possible B ways to arrange the columns/rows in the MLA combined A^B with the configurable crossbar implemented by the switch

A^B block makes the entire design robust to defects in either the switch block or the MLA. Figure 4. An AND gate implemented in the MLA of a nanoBlock. 3.2 Defect Tolerance

The nanoFabric is defect tolerant because it is regular, highly configurable, fine-grained, and has a rich intercon- nect.5 The regularity allows us to choose where a particular function is implemented. The configurability allows us to pick which nanowires, nanoBlocks, or parts of a nanoBlock will implement a particular circuit. The fine-grained nature of the device combined with the local nature of the intercon- nect reduces the impact of a defect to only a small portion of the fabric (or even a small portion of a single nanoBlock). Finally, the rich interconnect allows us to choose among many paths in implementing a circuit. Thus, with a defect map we can create working circuits on a defective fabric. Defect discovery relies on the fact that we can configure the nanoFabric to implement any circuit, which implies that we can configure the nanoFabric to test its own resources. Figure 5. A half-adder implemented in the MLA of a The key difficulty in testing the nanoFabric (or any nanoBlock and the equivalent circuit diagram for the FPGA) is that it is not possible to test the individual compo- computation of A XOR B = S. nents in isolation. Researchers on the Teramac project [3] faced similar issues. They devised an algorithm that al- lowed the Teramac, in conjunction with an outside host, pull the line labeled “red” in the figure down close to a logic to test itself [12, 13]. Despite the fact that over 75% of low. This makes diode 4 forward biased. By manufacturing the chips in the Teramac have defects, the Teramac is still the resistors appropriately (i.e., resistors attached to Vdd used today. The basis of the defect mapping algorithm is to have smaller impedances than those attached to Gnd) most configure a set of devices to act as tester circuits. These cir- of the voltage drop occurs across R1, resulting in S being cuits, e.g. linear-feedback shift-registers, will report a result high. If A and B are both low, then diodes 2, 4, 5 and 7 are which if correct indicates with high probability that devices back-biased. This isolates S from Vdd and makes it low. they are made from are fault free. The MLA computes logic functions and routes signals The defect mapping process is complicated by the fact using diode-resistor logic. The benefit of this scheme is that there are no known fault-free regions in the nanoFabric at we can construct it by directed assembly, but the drawback the start of testing and as the defect density increases the is that the signal is degraded every time it goes through a chances of finding a good circuit decreases. To address configurable switch. In order to restore signals to proper this problem, we will implement in CMOS the components logic values without using CMOS gates, we will use the of a basic tester, which will provide the initial fault-free molecular latch described in Section 2.1. region. A host computer configures testers out of the Notice that all the connections between the CMOS layer CMOS testor and portions of the nanoFabric. As the and the nanoBlock occur either between groups of wires or host gains knowledge of the fault free regions of the with a wire that is removed from all the other components. nanoFabric it replaces more and more of the CMOS tester This improves the device density of the fabric. To achieve with configured sections of the nanoFabric. a specific functionality the cross-points are configured to be 4This is because nanowires have several orders of magnitude less re- either open-connections or to be diodes. sistance than switches. Thus, the timing of a block is unaffected by small The layout of the MLA and of the switch block makes permutations in the design. 5 rerouting easy in the presence of faults. By examining Fig- Defect tolerance through configuration also depends on shorts being significantly less likely to occur than stuck-open faults. We arrange for this ure 5, one can see that a bad switch is easily avoided by by biasing the synthesis techniques to increase the likelihood of a stuck- swapping wires that only carry internal values. In fact, the open fault at the expense of potentially introducing more total faults.

5 After a sufficient number of functioning resources have each nanoscale wire separately in space we address them been discovered by the host, phase two of the testing begins. separately in the time dimension. This slows down the con- For this phase, the host downloads a configuration, which figuration time, but increases the device density.

performs the testing from within the fabric. In other words, Our preliminary calculations indicate that we can load 9

the already tested area of the fabric acts as a host for testing the full nanoFabric, which is comprised of ½¼ configura-

¾ ½¼ the remainder of the fabric. Because the number of tests re- tion bits at a density of ½¼ configuration bits/cm ,in quired to isolate any specific defect does not grow as the to- less than one second. This calculation is based on realis- tal size of the device grows, the computational work needed tic assumptions that, on average, fewer than 10% of the bits to test a device is at worst linear in the size of the device. are set ON and that the configurations are highly compress- For very large devices, such defect discovery can be accom- ible [19]. It also significant to note that it is not necessary to plished using many parallel independent test machines. configure the full fabric for defect testing. Instead, we will Once a defect map has been generated the fabric can be configure only the portions under test. used to implement arbitrary circuits. The architecture of As the configuration is loaded onto the fabric it will be

the nanoBlock supports full utilization of the device even in used to configure the nanoBlocks. Using the configuration ¿¼¼ the presence of a significant number of defects. Due to the decoder this will require  cycles per nanoBlock, or way we map logic to wires and switches, only about 20% of less than 38K cycles per cluster. Thus, the worst-case time the switches will be in use at any one time. Since the inter- to configure all the clusters at a very conservative 10 MHz nal lines in a nanoBlock are completely interchangeable, we requires three seconds. generally should be able to arrange the switches that need to be configured in the ON state to be on wires which avoid 3.4 Putting It All Together the defects. The nanoFabric is a reconfigurable architecture built out While the molecules are expected to be robust over time, inevitably new defects will occur over time. Finding these of CMOS and CAEN technology. The support system, i.e., power, ground, clock, and configuration wires, I/O, and ba- defects, however, will be significantly easier than doing the original defect mapping because the unknown defect den- sic control, is implemented in CMOS. On top of the CMOS sity will be very low. we construct the nanoBlocks and long-lines constructed out of chemically self-assembled nanoscale components. 3.3 Configuration Assuming a 100nm CMOS process and 40nm cen- ters with 128 blocks to a cluster and 30 long-lines per

The nanoFabric uses runtime reconfiguration for defect channel, our design should yield 750K clusters/cm ¾ (or

¾ ½¼ testing and to perform its intended function. Thus, it is es- 1M blocks/cm ), requiring ¿ configuration bits. If the sential that the time to configure the fabric scale with the nanoscale wires are on 10nm centers this design yields 1M size of the device. There are two factors that contribute to clusters/cm¾ . (If, instead of molecular latches, transistors the configuration time. The first factor is the time that it were used for signal restoration then with 40nm centers for takes to download a configuration to the nanoFabric. The the nanoscale wires we obtain 180K cluster/cm ¾ .) second factor is the time that it takes to distribute the config- SPICE simulations show that a nanoBlock configured to uration bits to the different regions of the nanoFabric. Con- act as a half-adder can operate at between 100MHz and

figuration decoders are required to serialize the configura- 1GHz. Preliminary calculations show that the fabric as a

½:¾

tion process in each nanoBlock. To reduce the CMOS over- whole will have a static power dissipation of  watts

:4 head, initially we intend to only configure one nanoBlock and dynamic power consumption of  watts at 100Mhz. per cluster at a time. However, the fabric has been designed so that the clusters can be programmed in parallel. A very 4 The NanoFabric Model conservative estimate is that we can simultaneously config- ure one nanoBlock in each of 1000 clusters in parallel. There are two scenarios in which nanoFabrics can be A molecular switch is configured when the voltage used: as factory-programmable devices configured by the across the device is increased outside the normal operat- manufacturer to emulate a processor or other computing de- ing range. Devices in the switch blocks can be configured vice, and as reconfigurable computing devices. directly by applying a voltage difference between the long In a manufacturer-configured device, user applications intercluster lines. In order to achieve the densities pre- treat the device as a fixed processor (or potentially as a small sented above, it will also be necessary to develop a con- number of different processors). Processor designers will figuration approach for the switches in the MLA that is use traditional CAD tools to create designs using standard implemented with nanoscale components. In particular, a cell libraries. These designs will then be mapped to a partic- nanoscale decoder is required to address each intersection ular chip, taking into account the chip’s defects. A finished of the MLA independently. Instead of addressing accessing product is thus a nanoFabric chip and a ROM containing the

6 configuration for that chip. In this mode, the configurability (TAM) thread [15], communicates with other threads asyn- of the nanoFabric is used only to accommodate a defect- chronously using split-phase operations. This partitioning prone manufacturing process. While this provides the sig- allows the CAD tools to concentrate on mapping small iso- nificant benefits of reduced cost and increased densities, it lated netlists and it has all the mechanisms required to sup- ignores much of the potential in a nanoFabric. Since de- port thread-based parallelism. fect tolerance requires that a nanoFabric be reconfigurable Unlike a traditional thread model, where a thread is asso- why not exploit the reconfigurability to build application- ciated with a processor when executing, each SAM thread specific processors? will be a custom “processor.” While it is possible for a Reconfigurable fabrics offer high performance and effi- thread to be complex and load “instructions” from its local ciency because they can implement hardware matched to store, the intention is that it remains fairly simple, imple- each application. Further, the configurations are created at menting only a small piece of a procedure. This allows the compile time, eliminating the need for complex control cir- threads to act either in parallel or as a series of sequential cuitry. Research has already shown that the ability of the processes. It also reduces the number of timing constraints compiler to examine the entire application gives a reconfig- on the system, which is vital for increasing defect tolerance urable device efficiency advantages because it can: and decreasing compiler complexity.

¯ exploit all of an application’s parallelism: MIMD, The SAM model is a simplification of TAM. A SAM SIMD, instruction-level, pipeline, and bit-level. thread/processor is similar to a single-threaded codeblock in

¯ create customized function units. TAM, and memory operations in SAM are similar to mem-

¯ eliminate a significant amount of control circuitry. ory operations in Split-C [14]. In a sense, SAM will im-

¯ reduce memory bandwidth requirements. plement (in reconfigurable hardware) Active Messages [38]

¯ size function units to the application’s natural word size. for all interprocessor communications. While this model is

¯ use partial evaluation and constant propagation to reduce powerful enough to support multi-threading, in this paper the complexity of operations. we use it only as a way of partitioning large sequential pro- However, this extra performance comes at the cost of sig- grams in space on a reconfigurable nanoFabric.

nificant work by the compiler. A conservative estimate for While SAM can support parallel computation, a paral- ¾ the number of configurable switches in a ½cm nanoFabric, lelizing compiler is not necessary. The performance of this

including all the overhead for buffers, clock, power, etc. is model rests on the ability to create custom processors. A ½½ on the order of ½¼ . Even assuming that a compiler manip- compiler could (and in its first incarnations will) construct ulates only standard cells, the complexity of mapping a cir- machines in which only one processor is active at a time. cuit design to a nanoFabric will be huge, and this creates a Later, as the compiler technology becomes more mature, compilation scalability problem. Traditional approaches to the inherently parallel nature of the model can be exploited. place-and-route in particular will not scale to devices with The SAM model explicitly hides many important details. billions of wires and devices. For example, it neither addresses dynamic routing of mes- In order to exploit the advantages listed above, we sages nor allocation of stacks to the threads. Once an ap- propose a hierarchy of abstract machines that will hide plication has been turned into a set of cooperating SAM complexity and provide an intellectual lever for compiler threads it is mapped to a more concrete architectural model designers while preserving the advantages of reconfigurable which takes these issues into account. The mapping pro- fabrics. At the highest level is a split-phase abstract ma- cess will, when required, assign local stacks to threads, in- chines (SAM), which allows a program to be broken up into sert circuits to handle stack overflow, and create a network autonomous units. Each unit can be individually placed and for routing messages with runtime computed addresses. For routed and then the resulting netlist of pre-placed and routed messages with addresses known at compile time it will route units, can be placed. This hierarchical approach will allow signals directly. the CAD tools to scale. In this paper we present simulations of programs that have been decomposed into SAM threads. 5 SAM Simulations 4.1 The Split-Phase Abstract Machine (SAM) In this section we describe a limit study designed to determine the performance of applications mapped to the The compilation process starts by partitioning the appli- SAM model. Our goal was to determine how aggressive cation into a collection of threads. Each thread is a sequence compiler technology needs to be for the nanoFabric to com- of instructions ending in a split-phase operation. An opera- pete with a CMOS processor. We estimate the area and tion is deemed to be a split-phase operation if it has an un- simulate the execution time of applications using the sim- predictable latency. For example, memory references and plest SAM model, i.e. each thread executes sequentially, procedure calls are all split-phase operations. Thus, each and we do not perform any reconfigurable computing opti- thread, similar in spirit to a Threaded Abstract Machine mizations, e.g., loops are not turned into pipelines.

7 In this study we analyze the behavior of programs from in the second one)7. Transfers of control resulting from pro- the SpecInt95 [35] and Mediabench [22] 6 benchmark suites. cedure calls are weighted with the total size of the proce- We assume no parallel execution or pipelining between in- dure arguments. An edge between a basic block node and dependent SAM threads. a memory node is weighted with the number of accesses made from the block to that address. 5.1 Area requirements This graph is next placed on a two-dimensional grid. We strive to place nodes connected by heavy edges together. We define the unit of area as the area of the implemen- Assuming that the signal propagation time between two tation of one memory word (4 bytes). We assume that each nodes is proportional to their distance, such a placement integer operation can be also implemented in 1 unit of area. will minimize signal latency. We do the placement in two Integer multiplication and division and floating-point oper- stages: clustering and placement. ations have a substantially larger area. These assumptions

¯ In the first stage nodes are clustered together into super- are overly pessimistic for memory, which can probably be nodes of approximately the same weight (the weight of a packed more densely. We make the very conservative as- super-node being the sum the weights of all the component sumption that 1 unit of area is a single cluster. In fact, a nodes). Each supernode is 100 units of area. The cluster- cluster will probably be able to map between 1 and 64 units. ing is done so to minimize the total edge weight between For our benchmarks the total area was between 2,000 the super-nodes. For this purpose we use the METIS [20] and 250,000 units (see Figure 7), which fits liberally in the graph partitioning tool. available hardware budget. This area includes only the ex-

¯ In the second stage we place the super-nodes on a two- ecuted instructions and touched memory words. Dead code dimensional grid. We assume that each super-node is and unused memory is excluded from this count, but their a 10x10 square. We next repeatedly “glue” pairs of total size is small. super-nodes (obtaining progressively larger rectangles and 5.2 Simulation Methodology squares: 10x10, 10x20, 20x20, 20x40, 40x40, etc). The ends of the heavier edges are coalesced first. For each pair We compare the execution time of the program running of nodes we choose the relative orientation that minimizes native on the CPU (Alpha), and running on a nanoFabric. the total edge cost (i.e. the nodes can be flipped or rotated). We include application and library running time. We ignore 5.4 Trace-based simulation time spent in the operating system. The simulation consists of two phases: trace collection and analysis, and trace-based The second phase of the simulation runs the - simulation. instrumented binary, generating a trace of all important 5.3 Trace Collection and Analysis events: control transfer between blocks and memory read operations. We use the same input data that we used for the Each application to be simulated is compiled using gcc first phase. and optimized with -O4 on the Alpha. The ATOM tool [34] We use this trace to evaluate the running time of the pro- is used to instrument the program for trace collection and gram when implemented as a circuit whose layout is the one summarization. We instrument each basic block, procedure computed by our placement algorithm. The running time of call and memory access. a node contains the following components:

From the trace data we create a weighted undirected ¯ The time to pass control between the source to the des- graph (with weights both on nodes and edges). Each node tination. This is proportional to the Manhattan distance represents either a SAM thread or a memory location and between the two super-nodes where these nodes belong. its weight is the estimation of the area it requires on the ¯ The time to transfer data from the source node to the des- nanoFabric. The weight of an edge represents the total data tination. The registers which are live at the start of the traffic carried along that edge. Each basic block and each node and used within are assumed to come from the pre- memory address becomes a node. We eliminate some of vious node. We assume that the data is pipelined after the aliasing of the memory locations by assuming that the stack message which transfers control from the source, and each frame of each function is a different memory region. additional register value takes one clock cycle.

An edge between two basic blocks is weighted with the ¯ The time to execute the node itself. We assume full hard- number of data values transferred across that edge (i.e. the ware ILP (i.e. all independent instructions are executed in number of register values defined in the first block and used parallel). The time to execute the node is thus the time to execute the critical path. The critical path may depend on 6The other programs from these suites did not work properly with our simulation infrastructure; we present all the applications we could success- 7For a faster calculation, we approximate this value with the number of fully instrument and analyze. registers used in the second basic block.

8 ½ Figure 6. Simulated slowdown (simulated/base ). Figure 7. Total area for instructions and memory.

Figure 8. Average clock cycles per control transfer. Figure 9. Average clock cycles per memory read.

Figure 10. Breakdown of execution time for 1 Figure 11. Breakdown of execution time for 5 clock/square. Idle time is the time spent waiting for clocks/square. memory read operations to complete. dynamic information, like the memory addresses that are with the read address, and the opposite direction with the accessed. read data).

¯ Each write instruction takes one cycle and is completed asynchronously. We assume ordered message delivery. All We use the following model for the execution of an in- reads to the same address should be executed after the write struction: completes, even if they occur in different nodes.

¯ Each elementary instruction takes one clock cycle. ¯ For floating-point operations we use the double of the ¯ Each read operation takes time proportional to the Man- latency of the same instruction as executed on Alpha. hattan distance to the memory address which is accessed (the signal needs to propagate both ways: one direction

9 available on the nanoFabric. Spilled registers become un- necessary costly memory operations.

¯ We allocate a lot of area for each memory cell. 5.6 Results

Figure 7 shows that the hardware area used for each pro- gram, including memory. The unit of memory is the area taken by a simple integer instruction. We notice that all the

programs will fit within the hardware resources of a 1cm ¾ Figure 12. The placed graph for g721 e. nanoFabric. Before discussing other aspects of the performance of 5.5 Comments our proposed implementation, we should notice some char- acteristics of the program graph: although it is sparse (the In some respects our simulation methodology is overly average node degree is less than 10), the node degree distri- optimistic, in others it is pessimistic. Our goal here is to val- bution is very skewed (each graph has a few large “stars”). idate the assumption that we can map an application com- Figure 12 displays one of the smallest graphs, where each pletely to hardware. In practice, more aggressive compila- square is a super-node, obtained after clustering (for read- tion and resource sharing would change the resulting con- ability). The shading of the squares indicates how many of figuration substantially. Here, we point out some of the as- the objects inside are instructions. A white square contains sumptions which will need to be revisited in a more realistic only code. The edges show communication patterns. The study. edge width is the logarithm of the number of messages sent Optimistic assumptions: across the edge. The edge color indicates the mix of types

¯ The placed graph depends on the input data. In general of messages: dark edge indicates memory reads only, while we cannot know what locations and nodes will be used. lighter edges indicates control transfers, with intermediate

¯ The “inputs” (registers live on entry) to a node do not shadings for edges which carry mixed traffic. Despite the necessarily come from its immediate predecessor. graph being very small, it exhibits some typical features for

¯ We assume that control and data transfer can be done all our programs, like the big “stars”: code regions which directly between the two nodes involved. touch most of the memory of the program. “Stars” are bad,

¯ We assume that each procedure has a statically allocated because there is no way to place all adjacent nodes close to stack frame (This cannot be true for recursive procedures). the star’s center node; some have to be remote. “Hot” mem-

¯ We underestimate the area by not disambiguating the ory locations, which are touched by a lot of basic blocks, are aliasing caused by the re-allocation of de-allocated mem- less common. ory regions. For example, the memcpy standard library function was

¯ We completely ignore routability issues between nodes. the bottleneck in several programs, because it would access We assume sufficient wire bandwidth is available. most of the memory used by the program. To reduce the

¯ We ignore the propagation delay introduced by the size of the large “stars” we have inlined such functions at all molecular latches. their call sites, in effect duplicating the code. Each inlined Pessimistic assumptions: copy of memcpy might service a different set of memory

¯ We “issue” a memory read operation only when the read locations, to which it can be placed closer. This significantly instruction is executed. Reads could be initiated as soon as improves in performance (our figures show the performance the address is known. after inlining has been done). The area cost of inlining is

¯ No effort is made to customize the circuit for the applica- negligible (less than 1%). tion. For example, many of these applications have kernels Inlining was ineffective or inapplicable for functions which, when compiled properly, show speedup on the or- which were not leaves of the call graph, or which were der of 100x over a conventional processor [18]. called from a single place.

¯ We do not do any speculative or parallel execution. Figures 6, 8, and 9 explore the impact of the signal prop-

¯ Nodes are strict, having to wait for their inputs before agation speed on the application performance: the light bar starting execution. is drawn assuming the signals take one clock cycle between

¯ By not disambiguating between the uses of a memory two adjacent super-nodes, while the dark bar is for a five word which is freed and re-allocated we unnecessarily con- clock cycles/super-node latency. strain the placement of the graph. Figure 6 shows the slowdown of our applications. Nega-

¯ The code we execute had its registers allocated for the tive values indicate speed-ups. The media applications fare Alpha, which has many fewer registers than would be better than the SpecInt ones as they tend to have a smaller

10 memory footprint. Figures 8 and 9 show the impact of the some of the most onerous limitations introduced by chem- propagation delay on the cost of the control flow transfer ical assembly. It eliminates the need for transistors, the and of a memory read respectively. Because memory ac- neccesity of precise alignment and placement of wires, and cesses can be executed in parallel to each other or to execu- provides for defect tolerance. In conjunction with CMOS

tion of the code, some of the memory latency is hidden. The support circuitry it will create a reconfigurable fabric with

½¼ ¾ cost of control-flow transfers scales better (slower) with in- more than ½¼ gate equivalents/cm . creased propagation delay because the locality at the level The main computing element in the nanoFabric is a of the code is much better. There are few “hot” nodes, and molecular-based reconfigurable switch. We exploit the re- they can be placed very close to each other, often within the configurable nature of the nanoFabric to provide defect tol- same super-node. erance and to support reconfigurable computing. Reconfig- Figures 10 and 11 show how the running time of each ap- urable computing offers the promise of not only increased plication was spent for the two signal-propagation speeds. performance, but ammortizes the cost of chip manufactur- We notice that most often the dominant cost, especially for ing across many users by allowing circuits to be configured large signal propagation latencies, is in reading memory. post-fabrication. The molecular-based switch eliminates The lack of caches makes remote memory operations costly. much of the overhead needed to support reconfiguration— the switch holds its own state and can be programmed 6 Research directions without extra wires, making nanoFabrics ideal for recon- figurable computing. As the nanoFabric is a combination of Some form of memory data caching is crucial for reduc- two technologies, CMOS with CAEN, it suggests a hybrid ing the dominant runtime cost, memory access. Hardware architecture that may combine silicon-based custom circuits and software will have to be devised for imple- with CAEN-based reconfigurable ones. menting distributed caches. To support defect tolerance, ease of placement and rout- Besides the simple inlining strategy which we used to ing constraints, and enable faster compilation, we propose reduce the size of the “stars” in the graph, we can imag- a new architectural model, the split-phase abstract machine ine several other solutions, like making many copies of the (SAM). SAM ensures that all operations of potentially un- function code and randomly calling one of them each time, known latency use split-phase operations. The compilation or inlining non-leaf functions. strategy engendered by SAM is to partition an application Further code restructuring may be necessary in order to into independent threads at all split-phase boundaries. This better localize memory accesses. We could improve data partitioning will allow compilers and CAD tools to handle placement by using separate memory pools for each type of the large designs than can fit on a nanoFabric. While the object. model is powerful enough to express parallel execution, we The memcpy function can be optimized by a special im- perform a study which limits the model to sequential exe- plementation, as a three-party transaction. Instead of copy- cution, and uses only simple scalar compiler optimizations. ing the data from source to the code block and then to the Our simple compilation strategy produces results whose av- destination, it could directly ship the data from the sources erage performance is within a factor of 2.5 of the perfor- to the destination memory nodes. mance of an Alpha microprocessor, under the most pes- Predicated execution, speculative execution and code du- simistic delay assumptions for our fabric. We uncover sev- plication (for better placement locality) would reduce the eral performance bottlenecks which hint at future research cost of control flow transfers. avenues which will make the nanoFabric a viable substrate. Some of the techniques we propose to better handle hot spots will introduce further complications: replicat- Acknowledgments ing memory and using speculative execution would invali- date our assumption that consecutive operations to memory This work was sponsored in part by DARPA, under the reach the memory location in the order they were issued Moletronics Program, by NSF, under a CAREER award, (the triangle inequality can no longer be assumed). Special and Hewlett-Packard Corporation. The authors wish to distributed synchronization mechanisms will have to be de- thank the many reviewers for their helpful comments. vised to preserve the program semantics. References 7 Conclusions [1] I. Aleksander and R. Scarr. Tunnel devices as switching ele- In this paper we propose a new architecture, the ments. Journal British IRE, 23(43):177Ð192, March 1962. [2] Altera Corporation. Apex device family. http://- nanoFabric, based on chemically assembled electronic www.altera.com/html/products/apex.html. nanotechnology. The nanoFabric is designed to harness the [3] R. Amerson, R. Carter, B. Culbertson, P. Kuekes, and smallness of electronic nanotechnology and to overcome G. Snider. Teramac–configurable custom computing. In

11 D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE [21] R. W. Keyes. Miniaturization of electronics and its limits. Workshop on FPGAs for Custom Computing Machines, IBM Journal of Research and Development, 32(1):24Ð48, pages 32Ð38, Napa, CA, Apr. 1995. Jan 1988. [4] A. Aviram and M. Ratner. Molecular rectifiers. Chemical [22] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Me- Physics Letters, 29(2):277Ð283, Nov. 1974. diabench: a tool for evaluating and synthesizing multime- [5] V. Betz and J. Rose. Vpr: A new packing, placement and dia and communications systems. In Micro-30, 30th annual routing tool for fpga research. In Proceedings of the Interna- ACM/IEEE international symposium on Microarchitecture, tional Workshop on Field Programmable Logic and Applica- pages 330Ð335, 1997. tions, Aug. 1997. [23] C. Lent. A device architecture for computing with quantum [6] F. Buot. Mesoscopic phyics and nanoelectronics: dots. Proceedings of the IEEE, 85, April 1997. Nanoscience and nanotechnology. Physics Reports, pages [24] T. Mallouk. Nanomaterials: Synthesis and assembly. 173Ð74, 1993. http://research.chem.psu.edu/mallouk/- [7] J. Chen, M. A. Reed, A. M. Rawlett, and J. M. Tour. Obser- nano.pdf, Nov. 2000. Foresight Conference Tutorial. [25] B. Martin, D. Dermody, B. Reiss, M. Fang, L. Lyon, vation of a large on-off ratio and negative differential resis- M. Natan, and T. Mallouk. Orthogonal self assembly on tance in an electronic molecular switch. Science, 286:1550Ð colloidal gold-platinum nanorods. Advanced Materials, 2, 1999. [8] J. Chen, W. Wang, M. A. Reed, M. Rawlett, D. W. Price, 11:1021Ð25, 1999. [26] R. Mathews, J. Sage, T. Sollner, S. Calawa, C. Chen, L. Ma- and J. M. Tour. Room-temperature negative differential re- honey, P. Maki, and K. Molvar. A new rtd-fet logic family. sistance in nanoscale molecular junctions. Appl. Phys. Lett., Proceedings of the IEEE, 87(4):596, 1999. 77:1224, 2000. [27] J. Mbindyo, B. Reiss, B. Martin, B. Reiss, M. Keating, [9] R. H. Chen, A. N. Korotov, and K. K. Likharev. Single- M. Natan, and T. Mallouk. Dna-directed assembly of gold electron transistor logic. Appl. Phys. Lett., 68:1954, 1996. nanowires on complementary surfaces. Advanced Materials, [10] C. Collier, E. W. Wong, M. Belohradsky, F. M. Raymo, J. F. 2000. Stoddart, P. J. Kuekes, R. S. Williams, and J. R. Heath. Elec- [28] N. S. M.V. Martinez-Diaz and J. Stoddart. The self-assembly tronically configurable molecular-based logic gates. Science, of a switchable [2]rotaxane. Angewandte Chemie Interna- 285:391Ð3, July 1999. tional Edition English, 36:1904, 1997. [11] Y. Cui and C. Lieber. Functional nanoscale electronic de- [29] H. Park, A. Lim, J. Park, A. Alivisatos, and P. McEuen. vices assembled using silicon nanowire building blocks. Sci- Fabrication of metallic electrodes with nanometer sep- ence, 291:851Ð853, 2001. aration by electromigration. www.physics.berkeley.edu/- [12] W. Culbertson, R. Amerson, R. Carter, P. Kuekes, and research/mceuen/topics/nanocrystal/EMPaper.pdf, 1999. G. Snider. The teramac custom computer: Extending the [30] M. A. Reed. Molecular-scale electronics. Proceedings of the limits with defect. In Proc. IEEE International Symposium IEEE, 87(4), April 1999. on Defect and Fault Tolerance in VLSI Systems, 1996. [31] D. Rosewater and S. Goldstein. What makes a good molec- [13] W. Culbertson, R. Amerson, R. Carter, P. Kuekes, and ular computing device? Technical Report CMU-CS-01-114, G. Snider. Defect tolerance on the teramac custom computer. Carnegie Mellon University, April 2001. In Proceedings of the 1997 IEEE Symposium on FPGAs for [32] T. Rueckes, K. Kim., E. Joselevich., G. Tseng, C.-L. Cheung, Custom Computing Machines, pages 116Ð124, April 1997. and C. Lieber. Carbon nanotube-based nonvolatile random [14] D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, access memory for molecular computing. Science, 289:94Ð S. Lumetta, T. von Eicken, and K. Yelick. Parallel program- 97, 2000. [33] J. R. S. Brown, R. Francis and Z. Vranesic. Field- ming in split-c. In Proceedings of the Supercomputing ’93 Programmable Gate Arrays. Kluwer, 1992. Conference, Nov. 1993. [34] A. Srivastava and A. Eustace. Atom: A system for building [15] D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. von customized program analysis tools. Technical report, Digital Eicken. TAM — a compiler controlled threaded abstract Equipment Corporation Western Research Laboratory, 1994. machine. Journal of Parallel and Distributed Computing, [35] Standard Performance Evaluation Corp. SPEC CPU95 18:347Ð370, July 1993. Benchmark Suite, 1995. [16] A. N. et al. Room temperature operation of si single-electron [36] S. Tans and et al. Individual single-wall carbon nanotubes as memory with self-aligned floating dot gate. Appl. Phys. Lett, quantum wires. Nature, 386(6624):474Ð7, April 1997. 70:1742, 1997. [37] R. Turton. The Quantum Dot: A journey into the Future of [17] R. Feynman. Lectures in Computation. Addison-Wesley, Microelectronics. Oxford University Press, U.K., 1995. 1996. [38] T. von Eicken, D. E. Culler, S. Goldstein, and K. E. Schauser. [18] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, Active messages: a mechanism for integrated communica- R. Taylor, and R. Laufer. Piperench: A coprocessor for tion and computation. In Proceedings of the 19th Interna- streaming multimedia acceleration. In Proceedings of the tional Symposium on Computer Architecture, May 1992. 26th Annual International Symposium on Computer Archi- [39] Xilinx Corporation. Virtex series fpgas. http://www.xilinx.com/products/- tecture, pages 28Ð39, May 1999. [19] S. Hauck, Z. Li, and E. Schwabe. Configuration compression virtex.htm. [40] A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. Chimaera: for the Xilinx XC6200 FPGA. In IEEE Trans. on CAD of IC A high-performance architecture with a tightly-coupled re- and Systems, volume 18,8, pages 1107Ð13, August 1999. [20] G. Karypis and V. Kumar. Multilevel graph partitioning and configurable functional unit. In Proceedings of the 27th sparse matrix ordering. In Proceedings of the 1995 Intl. Con- Annual International Symposium on Computer Architecture, ference on Parallel Processing, 1995. June 2000.

12