INFORMATION TO USERS

The most advanced technology has been used to photo­ graph and reproduce this manuscript from the microfilm master. UMI films the original text directly from the copy submitted. Thus, some dissertation copies are in typewriter face, while others may be from a printer.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyrighted material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are re­ produced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each oversize page is available as one exposure on a standard 35 mm slide or as a 17" x 23" black and white photographic print for an additional charge.

Photographs included in the original manuscript have been reproduced xerographically in this copy. 35 mm slides or 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

Accessing the World's Information since 1938

300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA

Order Number 8812232

A standard-cell placement tool for the translation of behavioral descriptions into efficient layouts

Buchenrieder, Klaus Juergen, Ph.D.

The Ohio State University, 1988

Copyright ©1088 by Buchenrieder, Klaus Jflergen. All rights reserved.

UMI 300 N. Zeeb Rd. Ann Arbor, MI 48106

PLEASE NOTE:

In all cases this material has been filmed In the best possible way from the available copy. Problems encountered with this document have been identified here with a check mark V

1. Glossy photographs or pages t /

2. Colored Illustrations, paper or print ______

3. Photographs with dark background ^

4. Illustrations are poor copy ______

5. Pages with black marks, not original copy ______

6. Print shows through as there is text on both sides of p ag e ______

7. Indistinct, broken or small print on several pages. /

8. Print exceeds margin requirements ______

9. Tightly bound copy with print lost in spine ______

10. Computer printout pages with indistinct print ______

11, Page(s) ______lacking when material received, and not available from school or author.

12. Page(s) seem to be missing in numbering only as text follows.

13. Two pages numbered . Text follows.

14. Curling and wrinkled pages Z

15. Dissertation contains pages with print at a slant, filmed as received i /

16. Other

A STANDARD-CELL PLACEMENT TOOL FOR THE TRANSLATION OF BEHAVIORAL DESCRIPTIONS INTO EFFICIENT LAYOUTS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate

School of the Ohio State University

by

Klaus Buchenrieder, Dipl. Ing.(FH), M.S.

The Ohio State University

1988

Dissertation Committee: Approved by

N. Soundararajan

P. Sadayappon ✓—TWviser Department of Co'mputer D. Orin and Information Science <1)1988

KLAUS JUERGEN BUCHENRIEDER

All Rights R eserv ed To My Parents ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor, Professor

P. Sadayappan, for his constructive suggestions, encouragement and help during the development of this research. His guidance and support have been invaluable in the completion of this work.

I would like to express my appreciation to my formal advisor Professor

N. Soundararajan for careful reviewing of the dissertation and for his comments regarding the programming language aspects.

I would like to thank Professor D. Orin for his numerous suggestions and for providing access to the MOSIS chip-fabrication facilities. I am also grateful for the opportunity provided by him to participate in his robot-manipulator project os a VLSI- designer.

Thanks ore extended to all those very special friends who encouraged and supported me during the years and made it all the more rewarding. I am especially indebted to E. Stavely and Professor K. Schwan for help and encouragement at a critical point in my graduate studies. I am also thankful to M. Kaelbling, for without his assistance this dissertation would still be in the formative stage.

I would like to acknowledge the support of the German Fulbright Commission,

that granted me a one year’s full-stipend to continue my graduate work here at the

Ohio State University.

Finally, I wish to express my appreciation to my parents Irene and Fritz, my aunt

Maria, and my girlfriend Daniela for their support and patience. VITA

June 25,1958 ...... Bom - Munich, West Geimany 1978 ...... Fachhochschulreife, Staatliche Fachoberschule Muenchen, Fachrichtung Technik 1982 ...... Diplom Ingenieur(FH), Fachhochschule Muenchen, Electrical Engineering, Munich, West Germany 1984 ...... M.S., The Ohio State University, Columbus, Ohio

FIELDS OF STUDY

Major Field: Computer and Information Sciences

Studies in: Computer Architecture and VLSI-Design, with Prof. P. Sadayappan Computer Aided Design, with Prof. V. Ashok Operating Systems, with Prof. K. Schwan Table of Contents

ACKNOWLEDGMENTS ...... iii VITA ...... iv LIST OF TABLES ...... vii LIST OF FIGURES ...... viii

CHAPTER PAGE I. INTRODUCTION...... I n. SURVEY OF RELATED W ORK ...... 7 1. Introduction , ...... 7 2. Layout Design Methods ...... 7 3. Design with Standard C ells ...... 19 1. Constructive Placement Techniques ...... 23 2. Iterative Improvement Techniques ...... 25 4. Circuit Transformations ...... 26 5. Sum m ary...... 23 HI. THE CONSTRUCTIVE PLACEMENT STRATEGY ...... 30 1. Introduction ...... 30 2. The Basic Design Approach...... 33 3. The Basic Realization Language ...... 35 1. S y n tax ...... 35 2. Modifications to the B N F ...... 39 4. Circuit Synthesis...... 41 1. Parsing...... 41 2. Tree Flattening ...... 45 3. Netlist Generation ...... 48 4. Layout Construction ...... 51 5. Sum m ary...... 54 IV. LAYOUT OPTIMIZATION ...... 56 1. Introduction ...... 56 2. Redundancy Elimination ...... 58 1. Elimination of Right Recursion ...... 59 v 2. Cascaded Inverter Reduction ...... 63 3. Identifier Merging ...... 68 4. Restructuring ...... 71 5. Sinistral Tree Adjustment ...... 76 6. Common Subexpression Elimination ...... 85 3. Peephole O ptim ization ...... 91 4. Routing Considerations ...... 97 5. Summary...... 105 V. DESIGN OF STANDARD CELLS...... 107 1. Introduction ...... 107 2. Standard Cell Design Frame ...... 108 3. Complementary CMOS P-Well Technology ...... 119 4. Latch-up Prevention ...... 121 5. Physical Layout of Standard Cell Logic ...... 125 1. Placement Algorithm for Fully Complementary Circuits ...... 126 2. Placement Considerations for C-Switches and Single Pass Transistors 133 6. Cell Simulation and Logical Verification ...... 136 7. Summary ...... 142 VI. EVALUATION OF THE APPROACH...... 143 1. Introduction ...... 143 2. Basic Measurements of.Simulated Annealing ...... 146 3. Comparison of Functional Placement and Simulated Annealing ...... 150 4. Low Temperature Cost Improvement ...... 158 5. Summary...... 162 VH. SUMMARY AND CONCLUSIONS...... 165 1. Research Contributions ...... 165 2. Research Extensions ...... 168 Appendix A. EXAMPLE OF A 4-BIT COUNTER DESIGN ...... 171 Appendix B. LARGE DESIGN EXAMPLES ...... 179 1. 24-hour C lock ...... 179 2. Towers of H a n o i ...... 183 Appendix C. CELL CATALOG ...... 186 BIBLIOGRAPHY...... 211

vi List of Tables

TABLE PAGE 1.BNFofthe basic implementation language ...... 36 2. External function extension for the BNF ...... 40 3. Grammar modified for left-recursion ...... 61 4. Data for around-the-cell routing of a clock chip ...... 153 5. Data for through*the-cell routing of a clock chip ...... 156 6. Data collected for low temperature annealing processes performed on functionally placed layouts...... 163

vii List of Figures

FIGURE PAGE 1. Semi-custom design techniques ...... 10 2. Dual phase clocked fsa ...... 11 3. PLA based robot control circuit ...... 12 4. Six transistor site [13] ...... 13 5. Standard cells with single sided connection points ...... 16 6. Combined macro-, and standard cell layout ...... 17 7. MacPitts fixed "floor-plan" model ...... 18 8. Digital clock circuit based on standard cells ...... 20 9. Different representations of a standard cell ...... 31 10. Encoder circuit with three inputs ...... 32 11. Cross-coupled nand storage circuit ...... 33 12. Intermediate and feedback wire ...... 36 13. Hexadecimal counter layout ...... 38 14. Parse tree for a 4 bit counter...... 44 15. Threaded tree with preorder numbering ...... 46 16. Linear cell arrangement for a counter ...... 47 17. Back-, and Cross-edges in a tree ...... 49 18. Layout folding...... S3 19. Derivation trees for a simple expression ...... 60 20. Operator exchange to remove right-recursion ...... 62 21. Cascaded inverter chains ...... 64 22. Inverter chain elimination ...... 66 23. Identifier merging of set S = (BjyjLJCJL)...... 69 24. Lexicographical tree transformations ...... 74 25. Sinistral tree adjustment of two subtrees S j and S2...... 76 26. Subtree exchange of equivalent size subtrees with different synthesized strings ...... 77 27. Left adjusted tree structure ...... 80 28. Synthesizing leaf-identifier attributes in a tree fragment ...... 83 29. Common subexpression elimination ...... 89 30. Code depth improvement of an AND-NAND structure ...... 94 31. Extended algebraic transformation ...... 95 32. Functions available within an EXOR-cell ...... 95

viii 33. Edges for short range wiring ...... 102 34. Backedges to closest identifier...... 103 35. Over-the-cell track assignment ...... 104 36. Primary rows reserved for transistors ...... 109 37. Design grid with inverter structure ...... I ll 38. Basic design frame ...... 112 39. Polysilicon intercell connection stub ...... 115 40. Vertical cell connections of a DFF ...... 116 41. Stem extension and grid alignment ...... 117 42. Gridsnap spacing to terminal points ...... 118 43. Minimal dimensions for vertical connections ...... 119 44. Inverter in Si-Gate Technology ...... 120 45. Parasitic pnpn-device ...... 122 46. Invener with additional substrate contacts ...... 123 47. Substrate contact and well protection ...... 124 48.12 transistor EXOR implementation ...... 126 49. Equipotential points and colored graph G ...... 127 50. NIF conditional implementation ...... 134 51. Stage ratios for two EXOR implementations ...... 138 52. Pn-inverter pair estimation of Cj ,C2 ...... 140 53. Device parameter estimation ...... 141 54. Parameter estimation for T2 ...... 141 55. Cost distribution of the annealing tool ...... 147 56. Wire-length distribution of the annealing tool ...... 148 57. Estimation of the lowest cost ...... 149 58. Line graphs for around-the-cell routing of a clock chip ...... 154 59. Line graphs for through-the-cell routing of a clock chip ...... 157 60. Comparison with force directed pairwise interchange ...... 158 61. Wire length results for low-temperature annealing ...... 162 62. Total cost results for low-temperature annealing ...... 162 63. Four-bit counter with adders as incremented ...... 172 64. Counter with non-optimized horizontal wiring ...... 176 65. Counter using left-branched result propagation ...... 176 66. Counter with maximized result propagation ...... 177 67. Counter using over-the-cell routing ...... 177 68. Over-the-cell wire track...... 178 69. Photomicrograph of the 24-hour clock ...... 182 70. Photomicrograph of the Towers of Hanoi chip ...... 185

ix Chapter I

INTRODUCTION

About 20 years ago the evolution of very large scale integration (VLSI) began with integrated circuits containing some two hundred transistors on a single piece of silicon. The comparatively low density was mainly due to the bipolar technology in which early integrated circuits have been fabricated. MOS (metal oxide semiconductor) technology helped to overcome the technological limitations of small scale integration by 1965. Medium scale integration (MSI) allowed for 200 transistors on a 10 mm2 piece of silicon, called a "die". At this time many circuit design engineers recognized the convenience and importance of MSI and turned away from implementations using discrete components mounted on printed circuit boards. The density of the logic, its speed and low power consumption made MSI logic very popular and soon design engineers asked for even larger integration densities. Large scale integration stepped in to meet the demand in 1971, and then about 2000 transistors became available on a single chip (counter, arithmetic logic units and small memory products). Technological advances that caused an annual feature size reduction rate of 13% for active components on a chip, made this possible. Now, n- channel MOS (NMOS) technology allowing for several thousand transistors per die became the dominating semiconductor technology. Metal oxide semiconductor Held effect transistor devices quickly superseded bipolar transistors and today virtually all integrated circuits are based on MOSFETs. MOS-transistors are next to perfect for digital design, since FETs resemble ideal low power switching devices with an intrinsic switching time that decreases linearly with the dimensions of the component

(/a t!?™'1 Unglh ). Clearly device miniaturization has been the major technological cam tr vtlocity issue for the reduction of unit cost per implemented function, improvement of the fabrication yield and the increase of chip performance. Technological process and design advances in the time period since then have had a significant influence on device miniaturization and cleared the way for VLSI, allowing more than 500,000 devices on a 10 mm2 silicon die. However, the rate of growth has slowed down in recent years because of difficulties in defining, designing and testing complicated chips. The problem is especially pronounced when complex control sections that contain large sequential units are to be designed and must be laid out under rigid time constraints.

Numerous design methods were developed to resolve this problem and naturally modular strategies which encapsulate circuit components in building blocks are generally preferred. Building blocks or cells comprising a circuit are functional blocks such as arithmetic logic units, registers, bonding-pad logic or implementations of basic functions. Cells are treated as autonomous units with constituent devices. Its layouts are hidden from the user and called from a library only in the final stage of the layout production. In the user guided design phase of an , cells are arranged on the two-dimensional layout surface and connected by wires. The process of arranging standardized building blocks, called placement, is the topic of this dissertation.

The placement problem is defined as a mapping M of circuit elements e in a layout specification onto locations / on the layout surface, so that some objectives o are satisfied. With 5, that is the set of disjoint wire nets r l;m as specified for the optimization with o, one can formalize the problem like: A/(e,o,5)— Some typical objectives are: completely automatic routing capability, minimal total area, minimal total wire (net) length and maximized circuit performance. Specialized applications may contain additional objectives i.e: minimized signal crosstalk among interconnection wires or equalization of the heat distribution on the die (such specialized objectives are of no concern here). A set of circuit elements

£= (e,, ... ,e„}n > 0, (e corresponds to a component, or cell definition) and a netlist

5-{s,, ... ,s,Jm £0; are given in addition to o. Each wire net contains a vector with routing terminal denominators as components. Terminals are located at the borders of circuit elements, cells or the outside world (bonding pad interface logic).

After the placement for all elements e,.„ is completed, L= (I,,... ,1„) holds for each component i a unique location /~(x/ty^), that satisfies o guaranteeing no overlap.

While the interconnection or routing problem itself is a complicated task, it is not addressed here and the assumption is made that an efficient automatic routing tool is available [1]. Hence, the physical interconnection problem, that is the actual wire guiding on the layout surface is ignored. Only routing issues that affect neighborhood placement decisions are considered in order to overcome router deficiencies.

The goal of standard cell placement is to determine cell positions that satisfy technology as well as cell specific restrictions, and permit area efficient automatic routing. Standard cell placement is a difficult task since the minimum wire length is directly affected by any permutation of cells. For this reason, it can be viewed as an optimization problem which has been shown to be nondeteiministic polynomial (NP)- hard [2], This means that the placement problem is equivalent to any one of a large number of other optimization problems for which no exact or efficient deterministic algorithm is known. This fact is particularly unpleasant and the reason as well as the root for numerous heuristic attempts. Knowing that effective "solutions" for some very restricted instances of NP-problems exist, one looks for simplifications with the hope that the placement of standard cells is such a desired special case. Unfortunately, it is

not and remains in the domain of NP because no restriction to the problem leads to

optimal results in less than exponential time. This means, not even an efficient

algorithm for a "slightly modified", less restrictive standard cell placement problem can be given.

The objective of this dissertation is to develop a "standard cell placement tool for

the translation of behavioral descriptions into efficient standard cell layouts". The research includes three major areas: the development of a method for fast and accurate translation of behavioral logic descriptions into a layout topology, placement and routing optimization and standard cell design.

The placement strategy proposed in this dissertation is different from existing methods, because it is based on syntax directed parsing of a high level expression

instead of a heuristic method. It does not process low-level circuit information obtained with schematic capture but uses applicative specifications instead. A number of key ideas on standard cell placement have emerged from this research and are my contribution to the field of VLSI research. Some of the key concepts, such as initial placement construction from a threaded expression tree, local optimizations of tree fragments for more efficient layouts and faster logic and a construction method for symmetrical standard cell configurations, are original here. Other principles, such as compiler construction techniques, local circuit transformations and the Mead-Conway approach to layout design, to name a few, had previously been introduced. This dissertation applies these principles and techniques and shows the usefulness and appropriateness for VLSI design. 5 The dissertation is organized as follows:

Chapter two provides a survey of previous work in the areas of basic layout construction, standard cell placement and circuit transformation. This chapter provides background material.

The third chapter introduces the basic standard cell placement approach which is one of the key ideas of this research effort. It begins with the presentation of syntax and semantics of a functional realization language for the description of standard cell layouts. The main discussion in this chapter concentrates on the translational process from applicative expressions into standard cell topologies. The chapter focuses on syntax directed parsing, parse tree-flattening, pseudo code generation and layout construction. A simple digital counter circuit serves as an example, and illustrates automatic layout construction steps, that is placement and routing.

The fourth chapter centers around transformational optimization steps with the goal of obtaining a faster, more efficient layout. Optimizations are necessary to overcome placement deficiencies caused by the source language structure or the syntax directed parsing scheme, introduced in chapter three. The improvements are targeted towards cell-count reduction, propagation delay improvement and routing channel enhancement. Common subexpression elimination as well as constant folding resolves some problems caused by ill-constructed expressions whereas redundancy elimination, dag restructuring and peephole optimization effectively reduces the size of the parse graph. Such a reduction is desirable because it results in a lower cell count

and a shorter propagation delay. The chapter concludes with wiring improvements • which drastically reduce the excessive cost of switchbox routing.

Chapter five discusses the synthesis of custom standard cells from traditional

transistor diagrams. The design method shown here is one of the major contributions of this research effoit. It is based on transformations of a circuit graph, in which nodes correspond to pn-pairs, and the result is a mapping of dual row blocks to transistor positions in the design frame. Much of the detail given in the remaining part of this chapter concerns design specific issues like: latch-up prevention and cell interconnections. The chapter concludes with a short section on cell simulation and logic verification.

In chapter sue, the evaluation of the functional approach for standard cell placement is presented. The section contains individual test results and comparisons with simulated annealing [3,4] for different starting temperatures and around-the-cell as well as through-the-cell routing.

The last chapter summarizes the research contributions of this dissertation.

Several possible extensions to the research are also suggested. Chapter II

SURVEY OF RELATED WORK

n .l. Introduction

This chapter reviews design methods for layout construction, standard cell placement, silicon compilation and circuit transformation. The first section surveys basic semi-custom VLSI design techniques and the second section points out advantages of the standard cell approach. In the third section various placement schemes for cell-based VLSI design are discussed. The last section briefly explores transformation algorithms for technology adaptation and logic synthesis, known as local transformations.

H.2. Layout Design Methods

While the fabrication technology of integrated circuits improved, chip designers realized the necessity for more powerful layout creation tools. Several methods for chip layout construction were devised relating to increasing device density requirements. The range of the developed techniques spans from direct mask layer construction to silicon compilation. A layout of an integrated circuit is the link between the circuit description and the fabrication process. It is composed of several two dimensional patterns each representing a mask for the multi-stage lithographic fabrication process. In each step of this process, a mask is used for the selective patterning of donor or acceptor impurities on the silicon. Also, connection wires, glass cuts and other important structures are specified in the so called mask topology. Each individual mask is represented by a different color or pattern, so that a complete topology can be completely displayed on a graphics screen. Mask topology construction techniques are commonly regarded as traditional methods because, in the early days of chip design all masks were individually drawn. Today, layout construction is carried out with computer aided design tools and one distinguishes several construction techniques. These include: memory & analog design, hill custom approach, Mead-Conway ruleset and language-based stick compaction. Memory & analog design strives for maximum performance and density. Excellent topologies are achieved through hand-crafting and explicit layout work, guided and verified by physical device simulation and electrical transient analysis. Here, the designer considers specifics of the implementation medium and uses physical process characteristics for design optimization. The disadvantage of this method is that it requires too much knowledge about semiconductor properties and the fabrication technology. For this reason, other design methods, in which the implementation medium is transparent to the designer gained more popularity. In these methods, a high level of device abstraction is attained through switch level simulation or electrical behavior approximation.

Full custom techniques employ a more restrictive set of design rules. Chips are composed from individually sized devices. This requires a less sophisticated interactive layout editor for the construction, placement and interconnection of layout components. The main emphasis during the construction process is high device density, circuit performance and a reasonably short design time. During the construction of a topology, a set of stylized layout fragments or tiles is developed, expanded and applied wherever possible. Contrasting this, Mead and Conway [5] suggest a simple yet powerful design method that encourages regular structures for devices and routing material. The

Mead-Conway approach is aimed to take full advantage of predefined shapes that are tiled together following simple, conservative design rules. Area efficiency is not considered critical and the major advantage of this method is that full custom design becomes more affordable for large projects.

Even more economical with respect to design time but not quite as area efficient are stick-based designs which utilize abstract representations for devices and interconnects, for example LAVA [6] p.273. Layouts are specified in terms of standardized circuit components like transistors, interconnects and wires. Devices are either located at grid positions or follow the grid lines. An intelligent layout system translates stick-expressions into actual mask information. Wire thickness and exact device locations are supplied by the system which places, stretches and compacts sheet material. The idea of standardized components can be taken further and several devices can be combined into a larger unit. This leads up to a higher level of device abstraction, called semi-custom design.

Semi-custom layouts are patterned together from pre-designed building blocks which are equivalent to small circuits. Such circuits are used as modules and a designer is not directly confronted with difficult low-level layout development. Semi­ custom building blocks can be visualized as modules at the lowest level of a layout hierarchy. Currently four semi-custom design techniques are popular. Simplified examples of the four approaches, PLA, gate array, standard cell and macrocell are illustrated in Figure 1.

PLA based design (a) is often applied when combinational or sequential logic must be implemented. PLA structures are highly regular and can be automatically generated from equations [7], A PLA is a simple structure which contains two major, 10 directly connected parts, the AND and OR plane. The AND plane generates the terms of a canonical form of boolean functions and the OR plane implements the sum of products. Hence, the outputs of a PLA appear directly as the sum of products in canonical form of the inputs.

□ □□□□□□□ □ B M W BBSOS BU M M B S □ Bosss Bmaa b b b b i bbbbb n BOOM BBBMIB M BBBBB □ w BBBBB B B MBBBBB BBBBB □ BBBBB BBBBBBBBBB BBBBB □ BBBBB BBBBB BBBBB BBBBB □ BBBBB I B MBBBBB BBBBB H B B M BBBBBBBBBB BBBBB n I B M BBBBBBBBBB BBBBB □ 1-1 BBBBB BBBBBBBBBB BBBBB □ BBBBB BBBBBBBBBB BBBBB Q □ □□□□□□□

Pla-based Gate Array

□ □□□□□□ □ !□□□□□□□□! □ □ □ la □ □ I V// V. □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ B e □ □ □ □□□□□ □

Standard-Cell Macrocell Figure I: Semi-custom design techniques

The major advantage of PLA’s is that the regular layout is directly produced from textual descriptions. No layout editing is required. The regular topology emerges from fixed building blocks, usually referred to as tiles [8] p.370, which are quilted together by an algorithm. PLA sizes can be considerably reduced either by classical minimization techniques [9], PLA-slicing or PLA-folding [6] p.318. Often sequential 11 circuits are required that assert control signals to output lines which depend on the present state and input signals. Sequencers are realized by Finite state machines constructed from a PLA as array logic and a feedback path from output to input. A model of a combinational finite state machine is shown in Figure 2.

RND-Plane Or-Plane

Phil Phl2 > Register-1 (R1) — 5 Register-2 (R2)

Feedback rrf Inputs Control-Outputs Figure 2: Dual phase clocked fsa.

Two dual phase clocked registers R1 and R2 prevent racing and allow synchronization with other clock dependent components in the circuit. Most VLSI tools produce Moore machines to implement sequential circuits. This is sufficient because Mealy and Moore automaton are functionally equivalent [10] p.44. Figure 3

(page 12) shows an example of a PLA based controller chip in which sequential and combinational structures are implemented as FLAs. In contrast to PLAs, which are quilted from small tiles, gate arrays consist of a large number of identical sites which are fixed in position. Each CMOS-site embodies a fixed number of p-, and n- transistors. A typical six-transistor site arrangement is shown in Figure 4 (page 13). 12

Figure 3: PLA based robot control circuit.

In this particular arrangement the first transistor pair is disjoint. Storage elements and C-switches are simple to implement this way. Gate array designs enjoy great popularity because wafers can be prediffiised with the basic site structure and stockpiled. Two or three layers are left open for customer specific interconnections which are performed during array personalization. Because of the reduced number of fabrication steps, the fabrication tumarround time is very short. Small circuits fit the gate array approach best, since the number of possible site configurations is limited.

Area inefficiency, fixed site placement and resulting routing problems (channel congestion between sites) often limit the application of gate arrays [8] p.241, [11],

[12] p.384. MHMr '//////A

Vs//7S//t.

Figure 4: Six transistor gate array site [13]. 14 Gate array sites are fixed to predetermined grid positions on the layout surface and all connection points for site interconnections are fixed as well. Hence, gate arrays

(gate array sites) have very little degree of freedom. One degree of freedom can be introduced when cells are allowed to vary in width. Cells which have a fixed height but vary in width are more versatile.

Poiycell design relies on predefined logic or circuit cells from which a complete layout is constructed. Polycells are compact layout pieces that contain many transistors and represent basic gates or other logical units. In contrast to PLAs, polycells are generated once and are made available in design libraries. Designers treat polycells as abstract building blocks and include blocks in their work wherever convenient.

The poiycell design approach is based on either standard cells, gate cells or larger building blocks. The basic difference between poiycell and other traditional design styles is that polycells are not bound to a feted placement grid and may be arranged in any convenient way. An arrangement in cell-rows with routing channels between adjacent rows is most attractive when blocks of polycells are of comparable height within a close range. In addition, the routing area is often effectively reduced when narrow routing passages between cells are introduced. Vertical through-the-cell routing, which requires special connections on top and bottom of the cells, reduces long jogs and overcrowded narrow passages. Even though the construction of a cell library is very time consuming, the overall layout time for an entire chip is considerably reduced. This is a great advantage for time efficient layout of random logic networks. Compared with the gate array strategy, final chip sizes are much smaller because polycells are not limited by the number of transistors per cell. In general, poiycell designs are not as area efficient as user guided methods because more than sixty percent of the chip area is typically devoted to routing. In comparison, handcrafted full-custom layouts need only about half the routing area. 15 The standard-polvcell concept is based on small logical design units similar to standard logical design components like gates, flip-flop etc. (Figure lc page 10). All building blocks are predesigned according to a design scheme which hides transistor, layout and technology details from the user. The underlying rules guarantee physical and electrical compatibility of all cells within a cell family. Since all building blocks

(cells) follow the same set of rules, power and other inter-cell connections can be aligned to allow linear cell abutment, see Figure 5. Some cell families specify connection points for signal input/output on top and bottom of the cells [14] p.52,

[15] p. 125 for denser layouts.

Among cell based design schemes, macrocells are most versatile. Macrocells have, compared to standard cells, an additional degree of freedom since the height restriction is removed. In the literature, one can find a number of defmitions

[15] p.127, [16,17,18], Early macrocells were actually only an extension of gate arrays with series gating at the transistor level. Here macrocells served as a group of components. Logical building blocks were formed from one or more interconnected sites. In CMOS design, macrocells are connected by cell abutment to form larger functional units in a hierarchical fashion. Abutment is done automatically and cells which require context cells such as input/output are commonly used subcells. 16

flouting Area. Terminal Point* \ a * ted Rrea

M

Celtroiu 1

CfellrouJ■ ■ ■,'»{ 2

> s >%%SS%S%V»VA a V\SSS%\ v A s % \ V ^ v /

XESULflT Ground Figure 5: Standard cells with single sided connection points.

When design tools like PLA-generators became available, attempts were made to amalgamate standard-, and macrocell designs. Several placement algorithms for arbitrarily shaped blocks and appropriate routing techniques have been developed

[19]. Large, compact layouts for complex circuits can also be achieved when standard and macrocell blocks are directly combined into a single topology [20,21]. In such schemes, the smaller macrocells serve as "putty" to fill gaps between the larger arrangements. Standard cells are grouped into blocks which conveniently fold between the larger cells. High density layouts result from this and advantages of both design methodologies are combined. Combinational subcircuits may be realized with PLAs and sequential portions as standard cell blocks. Even though one may suggest many improvements in current systems, experimental results have shown the usefulness of this approach for medium size circuits. 17 DDQDQ DDOOD

Figure 6: Combined macro-, and standard cell layout.

Silicon compilers are automatic layout synthesis systems that produce complete mask topologies from structural or architectural descriptions. Layouts are constructed from abstract topologies which can be transformed into specific instances (tiles) by transistor sizing, compaction, input/output relocation and parameterization.

Structural approaches such as ALI [22] (Princeton) or SDL [23] specify layout topologies in terms of gates and connections. In SDL for example, gates or cells are basic hierarchical units that contain simple components like transistors, terminals, wires, polygons etc. The SDL-L generator allows for translation of a structural description containing component definitions, wiring information and repetition statements into a layout. The disadvantage of structural layout specification is that the designer must know what tiles look like and where to connect wires to it. Even 18 though a textual description replaces low level painting efforts, too much layout dependent information is necessary. Architectural compilers, like MACPITTS

[24,25] (Meta-Logic) or SILC [26] (GTE) produce fixed target architectures , similar to the example shown in Figure 7.

Out

Proc. I Sin Finite State Control Proc. n Sout

Flags

Figure 7: MacPitts fixed "floor-plan" model.

The floor-plan model consists of a data-path with inputs and outputs orchestrated by one or more finite state control sections. Logical slices, so called "organelles" are the basic logical modules of the data path, and programmed logic arrays (PLAs) constitute the finite control. A PLA is a compact layout structure for tool supported implementation of two-level logic from Boolean expressions. Compared to structural attempts, a higher level of abstraction is attained and concurrent data-path descriptions fit the architectural approach best. However, two major drawbacks are linked to architectural design strategies. First, basic modules are not accessible to the user and 19 thus the basic set of technology dependent datapath modules cannot be altered,

updated or expanded. Even more important is that a source description can only be

compiled into one specific target architecture and hence, designs are restricted to this

architecture or their implementation becomes very inefficient.

n.3. Design with Standard Cells

In contrast to the scalable tiles used by silicon compilers, standard cells are of

fixed physical height and a width that varies with a cell’s function. The standard cell design method is a very flexible design method because any desired architecture can be realized. Furthermore, it is economical in the sense that repetitive constructs may be efficiently accomplished through customized cells. One major advantage of standard cells is the regularity of the resulting layout as illustrated in the die micrograph shown in Figure 8.

Automatic layout generation using standard cells consists of two primary steps: placement [27] and routing [28]. Placement involves the positioning of cells on the layout surface and routing is concerned with cell interconnection. These two steps ore often considered as independent problems, but good placement considerably simplifies the routing task. The placement problem is generally defined as the procedure of mapping circuit elements onto a layout surface according to some set of constraints.

As mentioned in the previous chapter, the goal is to determine positions for all circuit elements so that completely automatic routing becomes feasible. In addition all possibly conflicting constraints must be satisfied. The major constraints are minimized routing area and net length. Both criteria are evaluated by estimating the maximum channel width which determines the spacing between adjacent cell-rows, Other constraints, like heat dissipation, net reduction, signal crosstalk, signal delay etc., are 20 usually neglected to keep placement algorithms simple. In fact, optimum solutions for placement problems can only be found for a small number of components through exhaustive permutation.

Figure 8: Digital clock circuit based on standard cells. Good results may also be found with optimization techniques based on objective functions. The general case however requires a heuristic approach because the placement problem is NP-hard [2,29,30]. Hence, optimal solutions cannot be guaranteed for a large problem size. Basic placement techniques for polycells are presented in the next section. It is assumed that a router, which utilizes Steiner tree

[31] optimization is available.

Placement strategies may be divided into two groups: constructive techniques and methods performing iterative improvements of an initial configuration. Constructive procedures complete an initial placement by including unplaced components in the design. Partial placements contain at least one placed component to start. The intent is to keep strongly connected components together, so that strongly connected clusters result from the procedure. Iterative improvement methods start with an initial configuration and reduce the overall cost of the placement in successive steps. In addition, some hybrid procedures have been developed using the benefits of both concepts. Hybrid methods concentrate on overcoming some specific disadvantage of either method. Two dimensional placement algorithms divide the problem in manageable steps to avoid extensive computations. Rectangular arrangements can be obtained by several different methods. Most popular is folding of linear cell lineups

[13,21,32,33] and sequential partitioning of cells into rows and columns [34].

Folding or partitioning first divides a linear lineup of cells, i.e. a long chain of abutted cells into groups with a pre-specified length. Then, the groups are separated and stacked in parallel with a sufficiently wide routing channel between adjacent rows, resulting in a two-dimensional arrangement. The difference between folding and partitioning lies in the division method. Folding selects continuous cells from the chain and reverses the direction of adjacent cell-rows, whereas partitioning makes some adjustment (cell-permutations) to achieve equal width rows. 22 The input to placement programs is generally obtained through schematic capture or from specifications prepared by the user. Circuit specifications come in many flavors. The range spans from explicit mask topology descriptions to high-level hardware descriptions [6] p.246, p.273, [35,36,37], [38] p.182, [39]. Recently, functional or applicative languages which can deal with combinational and sequential circuits have been introduced. A major advantage of applicative forms over sequential language descriptions is that circuit synthesis, implementation and evaluation are directly tied to the high level specification. The concept of abstract specifications for integrated circuits was introduced by Ayres [40]. Since functional logic formulations describe combinational circuits well [41,42], additional efforts were made to extend the existing language model for sequential constructs such as Mealy and Moore machines [43,44,45,46].

Layouts from functional descriptions are constructed in hierarchical or stepwise fashion. Predicates in the program source are mapped to symbolic objects that contain abstract topologies. Since objects are treated independent from their physical properties, a compaction step must follow to resolve geometrical incompatibilities of the tiles. Routing problems and wasted area are consequently avoided through carefully orchestrated assembly, compaction and routing step.

Until now little conclusive research was done in this area, even though grammatical transformations of parse trees intuitively lead to improved layouts. Tree optimization may include insertion or movement of information in the tree [47],

[48] p.121. For example, semantically equivalent, but syntactically different subtrees may be mapped into a more appropriate structure and then substituted. Replicated or useless information (unreachable code) can be removed and the tree balanced thereafter. Later sections in this thesis investigate these concepts in greater detail. 23 U.3.1. Constructive Placement Techniques

Constructive placement methods develop a cell arrangement by incrementally adding components to an existing partial placement. Partial placements must contain at least one placed component, the seed component, to start. Usually several seeds are initially placed and closely related cells are added to the initial configuration. It is the goal of this technique to create strongly connected cell-clusters. Generally constructive algorithms rely on iterative methods that is combined with a heuristic evaluation mechanism.

Small standard cell layouts are frequently assembled by hand, because well trained designers ate oftentimes able to intuitively solve the NP-hard placement problem better than any algorithm can. Intuition or knowledge inherent from the circuit functionality leads to better results in many cases. Placement by hand becomes infeasible as the size of the circuit grows for obvious reasons. Similar to manual cell placement is random placement in which no circuit specific information is utilized and very poor results may be expected. Frequently random placement prepares the input for any iterative improvement scheme.

Cluster growth and cluster development are popular bottom-up techniques. Here an initial placement is constructed by scoring and selecting unplaced components. All algorithms in this class, partition the components under consideration into placed-, and unplaced-set and assign a seed component to the placed set. In successive steps all unplaced components are scored with regard to elements in the placed set and circuit specific information. The component with the highest score is moved to the placed set

[49, SO]. It is obvious that a "bad" seed selection may have disastrous effects on the final placement.

Cluster development algorithms differ from the above in that all selections are based on the connectivity patterns of circuit components. A generic cluster 24 development algorithm [51] picks the least connected component from the unplaced set as seed and repeatedly selects the most strongly connected component from the unplaced set and moves it. In case more than one component can be considered for the move, a tie breaking scheme is employed. Cluster development schemes either prepare an initial placement for iterative improvement, or directly establish a final placement.

The method is computationally inexpensive and adequate for medium size circuits because it considers only local constraints.

In contrast, partitioning based methods, like MIN-cut [52,53], defer local considerations as long as possible and tend to a global solution. Methods in this class recursively partition components into groups with the goal of avoiding wiring congestion. Bipaititioning algorithms work best with standard cells or gate array sites and can be applied in linear-, and two-dimensional problem space [15,32,52],

Rectangular assignment problems are approached in a top down fashion. First, unpermuted cell blocks are partitioned into rows and then MIN-cut resolves local congestions within rows [32], Several studies have shown that cluster development is preferable at times and an additional improvement phase may be required when the global wiring density is not uniform [54]. This phase improves the placement on a neighbor row basis by determining the actual paths in each routing channel. A major problem with MIN-cut algorithms becomes apparent when partitioning sequences contradict natural circuit partitions and clusters of logical units become separated.

Most available systems favor partitioning based placement methods because an implementation is not very time consuming and the computational complexity is reasonable. Unfortunately none of the techniques described so far can guarantee even a close to optimal solution (not considering exhaustion).

Branch and bound algorithms [30] p.370, [55] are valuable but they have the highest time and space complexity. Because of this, branch and bound may not be 25 considered for circuits with a large number of components. The strategy is to recursively improve the upper bound of a given initial placement by construction of subsets from the initial set (set partitioning). An upper bound is calculated for each set and those which yield a lower bound greater than the current bounding value are eliminated. Subtrees which will not lead to an acceptable solution are avoided through backtracking and evaluation of treenodes with bounding functions. The recursive division of subsets terminates when a "best" solution is found.

II.3.2. Iterative Improvement Techniques

Iterative improvement algorithms manipulate an existing placement with the goal for producing an even better placement at each step of the iteration. Within the improvement process a result is accepted only if the corresponding placement is better than the one of the previous iteration. Designers often establish an initial placement with a simple constructive scheme and then improve it with an iterative scheme. After a placement is established, routing congestions are relaxed by hand or through a user guided method.

Unconnected-set algorithms divide a given circuit into disjoint subsets and iteratively solve the assignment problem for the components within each set. It is obvious that this technique works well for some special cases, but has limitations and is not applicable to circuits without disjoint sets of components. Hence, the method is only applied to decompose large networks into smaller, tractable subnets and to determine least connected subsets in a hierarchical fashion.

More suitable for standard cells ore pairwise-interchange procedures which evaluate a current placement and attempt a global or local (neighborhood) improvement. Pairwise interchange algorithms permute cells of close proximity based 26 on random (heuristic) selection. A range-limiter restricts permutations to a rectangular window and thus prevents interchanges outside of the window. In global schemes, project and window size are equivalent such that interblock permutation is possible.

Among iterative schemes simulated annealing [3] seems most effective [4,56].

Similar to other schemes discussed so far, a possibly random initial placement is improved by generating alternate placements and evaluation of their score difference

AS. The score function S typically combines two measures: estimated wire length and total sum of the overlap penalties. The penalty assessed is proportional to the square of the linear cell overlap and is necessary because the interchange of two cells with different width results in a physical overlap. The acceptance of the placement generated by permutation is governed by the acceptance function as y(AS,T)ssmin[l,c—f]. / depends on the temperature 7, which is lowered in each iterative step with Tncw = aTold. Finally, a placement is heuristically accepted when the acceptance function yields a value larger than a random number in the interval from zero to one, that is: acceptance iff /(A S, T) k random(O.l).

Among all schemes described in this section, only pairwise interchange and simulated annealing with a high starting temperature (such that virtually all new states are initially accepted) are employed in conjunction with polycells.

II.4. Circuit Transformations

Originally, transformations were introduced for the construction of layouts from high-level specifications through a sequence of local transformations [57,58]. Logic synthesis through transformations is a special design technique in which transformational procedures are successively applied to a description that is initially technology-independent. The final objective of this method is to generate a technology-dependent physical layout in a hierarchical fashion. 27 Current transformational systems, as discussed in [59,60,61], minimize the desired logic first and then adjust fan-in/fan-out conditions. The first transformational step operates on higher level functions like decoders, adders etc. and simplifies the logic by module decomposition. Generated macros are then transformed into flip-flop, gates and other physical blocks. Fan-in/fan-out constraints are resolved by input factorization or output expansion [62]. This is followed by a step in which all flip- flops are transformed into gate arrangements or blocks through library lookup procedures. Delay, set/reset, preset and preclear requirements are also fulfilled in this step. Finally all superfluous inverter bubbles (two bubbles on the same line) are removed to eliminate area consuming cells from the circuit.

It is clear that the order in which local transformations are applied determines the effectiveness of the minimization process. Data driven expert systems like

SOKRATES [63], build and traverse a search tree to determine the best sequence of transformations. The root node of the search tree represents the current implementation; the child nodes denote configurations that result from applying a single rule. After the tree is completely constructed, it is traversed. Subsequently the configuration with the lowest cost is selected. The cost is determined by evaluating all subcircuit trees according to the number of logic levels and its particular area requirement.

Even though local transformations have great potential for cell-based layout improvement, current systems have only explored basic logic transformations. Further tree based transformations for standard cell layout improvement will be discussed in a later chapter. 28 H.5. Summary

This chapter has presented a brief overview of the previous work in the areas of layout design methods, standard cell placement and circuit transformations. In the course of the discussion, it became clear that fixed size standard cells are attractive for language based layout construction for several reasons. First, standard cell designs are not as restrictive as automatic architectural methods, because any desired architecture can be efficiently realized. Second, standard cells suit the realization of intermediate compiler code, since translators produce pseudocode that maps well onto basic functions implemented by standard cell building blocks. This is especially convenient because new functions can be defined by simply adding cells, so that any series of

(sequential) function invocations can be translated into a linear arrangement of cells.

Third, functions implemented by standard cells are closed over their domain (inputs and outputs ore boolean). Hence, signal wires may be directly propagated from one cell to another. In fact, wire propagation properties are detectable by modified code optimization techniques like register-store transfer minimization.

The discussion also made clear that design specifications for automatic cell placement must make use of information embedded in the logic specification, i.e. design specifications must be prepared so that a high level logic description which leads to an efficient layout can be easily formulated. To be useful, a hardware description must meet several requirements. First, it must allow for an accurate behavioral specification of any desired digital logic using a reasonable formulation.

Since a high level language specification gives the functionality of a design rather than an explicit layout description, a behavioral expression is best. It is most appropriate because applicative descriptions are intuitive, easy to use, and most importantly, not bound to any fixed architecture or fabrication technology. Moreover, a description 29 must efficiently map onto the desired implementation scheme, here standard cell logic, and support or guide its layout topology construction. Finally, it must be possible to give a deterministic algorithm that sufficiently describes a translator. The translation program or compiler must be capable of producing a complete layout as required for fabrication from any syntactically correct source. In addition, the cell catalog and its corresponding functions in the language must allow for easy expansion without the need to rewrite compiler code. Applicative descriptions were shown to be most attractive for this task, since predicates in functional descriptions correspond to either a basic function or a combination of several such basic functions. The next chapter formulates the basic idea of a new placement strategy based on the ideas presented above. Chapter III

THE CONSTRUCTIVE PLACEMENT STRATEGY

m .l. Introduction

A digital circuit description specifies two fundamental pieces of information: the components that constitute the circuit, and the interconnections of the components involved. In the context of this dissertation all components are standard or support cells that are available in a design libraiy. The library consists of a set of functionally well-defined building blocks that follow the specifications in Chapter V. The standard cell approach is favored here, because design-specific details are properly abstracted in a cell specification. Each cell has distinct inputs and exactly one output terminal, and is expressed with a simple function. Consider the basic logical expression

NAND(A,B) which describes a standard cell implementing the function AB . The data-flow is clearly unidirectional from the inputs A and B to the output of the cell denoted by NAND(A,B), see Figure 9. Since all cells enforce an unidirectional data­ flow, from high-impedance inputs to complementary-driven output, cells are characterized by distinct expressions. Larger expressions, those composed of more than one operation, are evaluated starting at the innermost basic expression, applying left-to-right precedence. Figure 10 shows an encoder circuit with three inputs, one for each information element, Eq; 2 , to be encoded. The circuit generates a single chip- select signal, CS, which is asserted for either E0E,E; or E0E,E3, and corresponds to

30 31 the logical function CS=[(E,©E2)*E0]. The equivalent applicative expression is

CS=AND(EXOR(Ej J^OT(E q )). The expression consists of three basic operators which can be mapped to a linear cell arrangement with a dominating left-to-right input to result propagation. A left-to-right pattern cannot be ensured in all cases, because sequential circuits contain at least one feedback path from an internal output to some point in the logic. The circuit in Figure 11 is referred to as a cross-coupled nand, and its applicative description requires a "recursion" on the output signal Q, as in Q=NAND(NAND(SET,Q),RESET).

R B NRND(R.B)

fi — N flN D (R .B ) B —

t NRND(R,B)

Figure 9: Different representations of a standard cell. 32 As shown in this example, the digital logic is expressed in terms of a function which poitrays the logical behavior as well as the physical components and the corresponding connectivity. Other physical requirements such as power consumption, timing constraints and signal delays are not considered in the format.

EH0 R(E 1,E2 )

NOT(EO)

CS-AND(EH 0 R(E1 ,E2 ),N 0T(E0 ));

Figure 10: Encoder circuit with three inputs.

The remaining sections of this chapter give an introduction to the basic design approach and present the syntax of the realization language. A simple counter arrangement serves as an example to illustrate how composite structures are assembled from basic expressions. In the third section, circuit synthesis and parsing, as well as tree-flattening and netlist generation, are highlighted. Circuit construction from a linearized tree and its corresponding netlist is briefly discussed in the fourth section. Concluding remarks follow in the last portion of this chapter. 33

SET----

NflND NflND SET CelM

Q RESET ---- RESET

Q-NHND(NRNO(SETtQ)tRESET);

Figure 11: Cross-coupled nand storage circuit.

m.2. The Basic Design Approach

The goal of the proposed placement scheme is to provide a flexible standard cell

layout tool. The scheme is constructive and furnishes the user with a netlist and a

standard cell configuration. Combined with other design tools, this information makes

an almost fully automated layout production possible. Even though a logic abstraction

is translated into a layout, one should not confuse the approach with a “silicon

compiler". Silicon compilers produce entire mask topologies ready for fabrication. In contrast to this, the proposed scheme applies recognized compiler techniques

[64,65] and produces partial layouts from functional source descriptions. The translator employs syntactical and semantical analysis. In the lexical analysis, or

scanning, phase the source code is read from left-to-right and grouped into tokens.

Tokens are sequences of characters, like words, that have a common meaning. During

this phase, errors are detected when a malformed source string does not lead to a token. Errors are flagged during the syntax analysis phase when the token stream 34 violates the syntactic rules. In syntax analysis, or parsing, tokens are hierarchically structured according to their relationship to other tokens. Parsing transforms the source into a tree structure that is needed to synthesize output. The parsing operations are based on the grammar that specifies what syntactically correct programs must look like. Semantic analysis scans the source for semantical errors. Since the circuit description language defined in Table 1 (page 36) offers only one data type (all operators and operands are Boolean), a semantic analysis for type-checking is superfluous. Similarly, flow-of-control checking is not required with a pure functional language, because explicit flow-control structures do not exist. Yet uniqueness and name-related checks must be performed. Identifiers, such as those necessary in connection with feedback wires, need to be unique. It is therefore necessary to ensure that there is exactly one definition for each name in a description. This is very similar to label definitions in normal programming languages, such as Pascal. Sometimes it is important that the same name appears more than once and thus name related checks are required. Uniqueness and name-related checks can be implemented by maintaining simple counters combined with some syntactic guidance in the name-related case.

After completing the parse tree, that is the target representation (here an intermediate code), tree-transformations are performed. Tree-transformations restructure the appearance of a tree while retaining the semantics of the tree. The transformation process is very similar to customary code optimization. As will be shown, some transformations are trivial but considerably improve the intermediate form. In contrast to programming-longuage compilers, a significant fraction of the compile time may be spent in this phase to achieve layout area reductions. Typical assembly, link and load steps that usually conclude a compilation ore not necessary, because a layout editor system con consume and process the intermediate code. Linking and loading corresponds to surface mounting and framing, a process necessary to prepare a design for fabrication, 35 m.3. The Basic Realization Language

Applicative programming languages have been shown to be most appropriate for digital hardware design [43,46]. A complete program represents a linear form of the schematic as components and corresponding connectivity formulated by equations.

Functional terms are concretely descriptive and unambiguous, and hence, most appropriate to capture a standard-cell based design specification.

in.3.1. Syntax

This section specifies the syntax of the basic realization language. A digital circuit is represented as a character sequence stored in a standard ASCII file. Files start with a complete list of input and output signals followed by one or more

"functions" used to define the circuit. Each function yields exactly one output and additional identifiers are necessary when more outputs are required. Input/output identifiers must appear exactly once as a left-hand-side of an expression. Feedback wires within the circuit are treated similarly and call for their own "recursive" function.

As in traditional CMOS-design, feedback loops in combinational logic and uninitializable storage elements must be avoided. Generally, expressions should be written as clearly and simply as possible and targeted towards modular simulation. A formal definition of the syntax is specified in Table 1. The grammar closely follows the specification in [66] p.8. 36

_ h D Q H- IL— S )— MRIN f)D - OFF H>-pIs C7 CLK— > CYOUT

*MfllN: •MAIN INPUTS: H,Y,Z; INPUTS: CYlNpCLK; OUTPUTS: MftlNpfl; OUTPUTS: QtCVOUT,MRIN; MRIN-N0R3(KfV,n); MRIN-CVOUT; fl-NOT(Z); CYOUT-RND(Q,CVIN); END. Q-DFF(EHOR(Q,CYIN),CLK); END.

Intermediate Wire Togglecell with feedback

Figure 12: Intermediate and feedback wire

Table 1: BNF of the basic implementation language

::= f*MAIN1 [Nnl END. : :=s INPUTS: ; |\n1 ::- OUTPUTS: ; [\n] ::= e I {t } ::= | I } ::= AIBICIDIEIFIG1HIII JIKIUMINIOIPI^IRISITiUI VIWIXIYIZ ::= II2I3I4I5I6I21I9I0 ::= MAIN= i [Nnl ::= { = x (Nn] } ::= NOT( } I AND( x ] I EXOR( x ) I NAND( l } I PFFf x ] I DENA( x , ] ::= I 37 In order to reduce the size and improve the clarity of the grammatical

specification some additional symbols have been included in the metalanguage. These

metasymbols (£,{,),[,],_), do not lead to a mathematically more powerful BNF, and the grammar is still context free. In fact, the string-to-string translation into a less elaborate BNF, without additional metasymbols, can be automated by a transducer.

The result, a Backus-Naur form with expanded shorthand notation metasymbols, is called pure BNF.

Square brackets [ ] indicate that the enclosed symbols are optional and the sequence may be omitted. Curly braces { } enclose sequences that may occur any number of times, including zero (Kleene star). £ denotes the null, or empty, string, and underlined _ symbols are terminal characters. These BNF extensions effectively compress the production rules. The program given below employs the grammar in

Table 1 and describes the hex-counter shown in the Figure 13. 38

Hexadecimal Counter:

♦MAIN: INPUTS:RES,CIN,CLK; 0UTPUTS:A3,C,D; MAIN=T4; T4=AND(D,T3): D=DFF(AND(RES,EXOR(D,T3)),CLK); T3=AND(C,T2); C*DFF(AND(RES,EXOR(C,T2)),CUC); T2=AND(B,TI); B=DFF( AND{RES,EXOR(B ,T 1 )),CLK); T1=AND{A,CIN); A=DFF(AND(RES,EXOR{A,CIN)),CLK); END.

.1 • ■ * m m m m ■ • MM« ■ * •• • • • ■ t • • ■ « ■MM ■ • •• It 1 •

* IIIIMMIMIIIIItlllllltMl I lililtlililililillUtlNII • ■ V* • • • • • • t • (S || fl ■ »• at * ■ • •••••• * • • • • • J •■ •J, >||{ « «9 • *»•■■£9 9 ■ 11 lilt l a t * t * f

Figure 13: Hexadecimal counter layout. 39 m.3.2. Modifications to the BNF

The BNF in its cuirent configuration allows for construction of unlimited size

programs as well as unbounded identifier strings, and contains no mechanism for

adding functions to the grammar. Storage restrictions of the machine on which a

translator implementation resides must be reflected in the grammar. The size of source

code and identifiers is machine and implementation dependent, and is bound to

100,000 characters for the source and eight characters for identifiers. This corresponds

to the current implementation of a complete design toolset in C-language on a SUN

3/160 color workstation. The following variation on the BNF reflects the bound on the

source code size. The notation ".."n denotes the maximum number n, with ;t£0, of

valid symbols enclosed in a pair of double quote delimiters. A maximum source program size of 100,000 characters is then specified like:

::= ""100000 ::= r*MAINl |\n! END.

Similarly restrictions on the length of identifiers, maximally eight characters, are specified with: :"{ 1 }"7 . A second, important addition to the grammar grants access to externally defined functions. This is profitable because the original grammar offers only a limited set of inherent functions.

New productions are introduced whenever a new cell for a frequently used subcircuit or component is made available in the standard cell library. Since new cells are designed and added throughout the evolution of a chip, a convenient extension mechanism employing a list of definitions was chosen. The list contains the function identifier and the number m of its input arguments. Repetitions are enforced with a new metasymbol Any sequence of symbols enclosed in the exclamations is repeated exactly m-times for m>0, and zero-times for m£0, which is the empty string e. Input of strings that denote nonterminals from files is expressed with the 40 metaoperator fread(partsJHe,data_item). The metaoperation returns the appropriate

data item from the parts file. The file is of type text, and each line contains a function

denominator string followed by the number of arguments. Consider the file entiy

'NOR 2 \ that defines a new function NOR which has two inputs and corresponds to a production: NOR«expression>.) . To achieve expansions like this, one must enhance the BNF production for as is

shown in Table 2.

Table 2: External function extension for the BNF.

::= NOTf 21 AND( x 11 EXOR( x ) I NANDf x ] I DFF( t } I DENAf 1 x ] I (!1!rePcal*l } ::= ''fread(’parts’,functionname)"8 repeat <— fread('parts',number^of.inputs) ::= I

With these extensions the derivation for a composite expression involving NOR, like: NOR(NOT(T),L) proceeds as follows: 41 —> ( ! J repeat-1 } —* freads(’parts’,functionname){ ! 1!rcPeat' 1 } -» NOR( ! 1!rcPeal‘^) —» NOR(!1!&Bat,f,Parts,™nctioanam^*1 } —> NOR(l ^'^expressior^ ) —> NOR( .) —> NOR( ^express ion> ) —> NOR( NOT( ) ^cxprcssior^ } —» NOR( NOT( ) ^identifie^ ) —> NORf NOT( } j ) —> NOR( NOTf } JL) -» NOR(NOT(T).L).

IH.4. Circuit Synthesis

This section describes the translation formalism from applicative expressions to pseudo-layout representations. The process is based on expression parsing, parse tree construction, and tree-flattening followed by a netlist generation phase.

m.4.1. Parsing

The major point of parsing, or syntax analysis, is the construction of a derivation, or parse tree, for a given program string co according to the rules R of grammar G governing the language L . All terminal strings of the functional hardware description language used here are specified by the grammar in Table 1 (page 36). Parse trees for context free strings, (oeL, are either constructed with bottom-up or top-down analysis.

Bottom-up analysis starts with (D and employs successive reduction steps until

from (0 and nonterminals from G. Reductions of the form Pay-4 (lAry, repeat until S is

reached.

Top-down analysis starts with non-terminal symbols and expands in each production

step a nonterminal symbol to finally produce CO. A top-down parsing method is

chosen here for several reasons. First, most boolean operators associate to the left, so

it is natural to employ a left-recursive grammar and parser. Second, LL-parsing

conditions can be easily verified by hand, and a transparent method for parser

construction from production rules exists for LL but not for LR (Goos and Waite

[67] give a rule-based construction procedure for a top-down pushdown automaton

that leads to a LL-parser). Last, an efficient LL-parser can be constructed by hand,

even though there may be ambiguities due to there being more than one valid parse

tree for a given program string. The ambiguities stem from the right-, and Ieft-

recursiveness of the grammar to the same nonterminal. Generally, such ambiguities

can be removed, but this is not necessary at this point of the discussion, because it

does not interfere with the meaning of the construct. However, extensions to the

grammar concerning the introduction of an extended function set may cause problems.

This is the case when functions are bound to a certain evaluation sequence of the input expressions (synchronization, delay). Ambiguous syntax trees must then be avoided with additional nonterminals that enforce single-sided recursiveness with respect to certain nonterminals.

The parse tree construction scheme is easily explained. It starts at the root which is labeled MAIN, and proceeds towards the leaves as formulated in the algorithm below. The parsing actions are controlled by a parse table. It contains the rules of G and describes the tree production actions that must be carried out for each phrase reduction. 43 Preorder Parse Algorithm:

Preoider_Paise_Tree():

i) If the function is composed of a single non-expandable token, then the preorder parse tree consists of just that single node.

ii) If a function is composed of more than one expandable token, then the preordcr parse tree consists of the root and a preordcr parse tree of each argument of the function in left-to-right order.

Left-descendant parsing order is assured through the proper sequence of recursive

calls in ii. Hence, the recursion stack reflects the proper sequence for subtree

expansions at any time. Note that the algorithm is guided solely by the syntax of &)

and duplicate subexpressions are expanded as subtrees. Some undesirable node

generation can be eliminated by identifier tagging employing a symbol table

. The symbol table holds all identifier names encountered during the parse

and marks terminals, as well as repeatedly encountered identifiers, as non-expandable.

In case a non-expandable token is parsed it is replaced with a nonterminal that refers

to the root of the already expanded subexpression. Duplicate subtrees are avoided

which results in a directed acyclic graph1 [68]. When a parse begins, a table is

initialized with leaf node identifiers from the input list. New identifiers encountered

during the parse that do not represent function calls, are treated as "local" identifiers.

Local identifiers connote feedback wires, and the corresponding subtree is expanded exactly once. After the expansion is completed an identifier is entered into the table.

The construction of a directed acyclic graph and its conversion into a tree2 is presented in the next chapter.

*A directed acyclic graph {dan) is a directed graph that contains no cycles. Cycles are avoided by means of local identifiers that correspond to wire jogs.

2A dag is called a tree with root kg, iff for every node k*kg there exists exactly one path from kg to k. A sequence {kg,.„kn) of nodes in a directed graph G is called a path if (k j.,^ ) are edges in G, with i=l„.,n. 44

t o q s t o q j

Leaend: ■ Backtdo** B Croasadgaa

Figure 14: Parse tree for a 4 bit counter.

Figure 14 depicts the parse tree for a counter circuit with the following description: Example of a Counter Circuit:

♦MAIN: INPUTSiCYIN.CLK; OUTPUTS:A3.C,D,EITOGl,TOG3,TOG5,TOG7: MAIN=*NOT(AND(AND(AND(ANOT(B)),C),NOT(D))); TOG7=AND(TOG5,D); D=DFF(EXOR(TOG5,D),CLK); TOG5= AND(TOG3,C); ODFF(EXOR(TOG3.C).CLK): T0G3=AND(T0G13)I B»DFF(EX0R(T0G13).CLK): TOGl= AND(CYINA): A=DFF(EXOR(CYINA).CLK): END. 45 The program describes a hexadecimal counter that returns a pulse whenever the

value ten is reached. During the construction of the parse tree, a thread is woven into

the data structure. It reflects the ordering in which the nodes were created, and hence,

simplifies explicit preorder tree traversals in successive processing steps. The pointer

treelist, refers to the rightmost node on the lowest level. It is the head of the linked-list

for fast traversals. Even though a tree walk, starting at treelist is a reverse preorder

traversal it is sufficient for the operations performed later. Figure 15 shows the

threaded expression tree with preorder numbering for the previously defined counter

circuit.

in.4.2. Tree Flattening

Tree flattening is the first step in the code production phase of a compiler. It changes the parse tree representation into a mostly linear structure for which a

convenient translation into a sequential code segment exists. Generally an automaton traverses the tree from top-to-bottom and generates pseudo code ready for further translation into machine code. Pseudo code is a p it' or postfix notation with control structures and procedures embedded in linearized expressions.

Complete tree flattening is accomplished in a single step, because the functional language lacks procedures and explicit control structures. The pseudo code for a parse tree consists of a linear lineup of cells and a netlist. A linear lineup of cells results when the tree-thread is stretched out and all nodes, other than cells, are disregarded in the process. 46

38 ITreellsl*!

Figure 15: Threaded tree with preorder numbering

Linearization Algorithm:

lincarization(trce): begin 1. for each node in the treelist do { 2. if (nodeid «= function_denominator) then ( 3. include nodeid and nodenumber in the linear placement; | }/*end-for*/ end;/+linearization*/

Statement 1 steps through the (linked-) list and statement 2 determines whether the node represents a function or not. Since functions directly correspond to standard cells, statement 3 produces an instance of a cell. In a successive step the translation into layout format is performed. The layout format contains calls to the cell library with appropriate cell displacement and rotation information. An initial placement, for the tree in Figure 14, in descending ordering is listed below, and is shown in Figure

15. Linear Placement of the Counter:

The initial placement is: AND_35 EXOR.33 D FFJ2 NOT_30 AND_25 EXOR.23 DFF.22 AND_L6 EXOR. 14 DFF_13 NOT. 11 EXOR.7 DFF.6 AND.4 AND.3 AND.2 NOT.l MAIN.O

moEXOR OFF mo EXOR OFF AND EXOR OFF EXOR OFF mo mo mo K>1 rtf.A x O I .J DFF.32 «. DFF..22 i :x o » .i DFF.13 t » » . 7 DFF_6 rtf. NO. NO. )T. IT n 7 T w n n n r

Figure 16: Linear cell arrangement for a counter. 48 ni.4.3. Netlist Generation

The interconnection of standard cells requires a netlist that contains terminal to terminal wiring information for each signal net. Extraction of a netlist from a parse tree is perfotmed with an algorithm similar to the one shown below.

Netlist Generation Algorithm:

ncllist(tree): begin

1. for each node in treelist do (

1.1 if (nodeid « identifier) then {

1.1.1 include parentnode in the net of this identiGer;

1.1.2 if (leftchild ]B nil) then include leftchildnode in the net of this identiGer,

1.1.3 if (rightchild != nil) then include rightchildnode in the net of this identiGer; }

1.2 if (paicnt(node) =* functiondenominator) then create dogIeg_net between current node and its parent node; }

2. dean_up_nets; end; /‘netlist*/

As in the last program example, statement 1 again traverses the treelist in descending order following the tree thread. The body of the loop generates identifier related nets in i.l. Intercell connections which propagate function-to-function results between adjacent cells are created in 1.2. Identifier nets can simply be determined by finding all nodes connected to an identifier node. Because back-, and cross-edges may exist, all tree-edges to-and-from identifier nodes must be explored. Consider the subtree shown in Figure 17. 49

Backedge DFF

CYIN AND

CYIN

Crossedge Figure 17: Back-, and Cross-edges in a tree

The tree fragment contains cross-edges sprouting from terminal identifier CYIN, and a backedge attached at A. Cross-edges and backedges are treated in a similar fashion by exploring all edges attached to the identifier node and included in the corresponding net. Step 2 in the algorithm cleans up the nets and removes duplicate

(wire-) identifiers. At the same time wire nets are merged with common nodes. A complete netlist for the counter is given below. Netlist for the Counter

Net DOG49: MAIN 00 NOT I -I Net DOG48: NOT I 0 AND 2-1 Net DOG47: AND 20 AND 3-1 Net DOG46: AND 3 0 AND 4 -1 Net DOG45: DFF6 0 EXOR 7-1 Net DOG44: DFF 13 0 EXOR 14-1 Net DOG43: DFF 22 0 EXOR 2 3 -I Net DOG42: DFF 320 EXOR 33-1 Net RC041: AND 21 NOT 30-1 Net RCO40: AND 4 1 NOT 11-1 Net TOG5: EXOR 33 0 AND 35 -1 Net D: EXOR 33 1 NOT 30 0 DFF 32 -1 Net TOG3: AND 35 1 EXOR 23 0 AND 25-1 NetC: AND 350 EXOR 23 1 AND 3 1 51 DFF 22-1 NetTOGl: AND 25 1 EXOR 140 AND 16-1 Net B: AND 250 EXOR 14 1 NOT 110 DFF 13 -1 Net A: AND 160 EXOR7 0 AND 4 0 DFF 6-1 Net CLK: DFF 32 1 DFF 22 1 DFF 13 I DFF 6 I Net CYIN: AND 161 EXOR71

IIL4.4. Layout Construction

Conventional compilers perform code generation in this step; here the corresponding action is cell assembly. Cells are mounted into rectangular blocks with sufficiently wide routing channels.

Physical cell placement constitutes the last step when generating an abstract layout. As in traditional polycell layouts, rows of cells are assembled from linear placement information. Each row is constrained by additional floor plan dependent parameters like maximal routing channel and row width. As in other implementation schemes, cells are grouped into maximum width rows. The rows are then snake folded and adjacent rows placed with a predefined spacing parameter. As an example, consider a linearization of n-cells with a width constraint vv’max and a routing height s. 52 Snake-Folding Algorithm:

snakefolding(tine an ration, wmax,s):

array ccUrowO; int arraypointer =■ 1, currcntwidth = 0, y_rowlocatioo a 0; Boolean left “ true;

begin

1 while (cells must be processed) do {

2 while (cunentwidth < wmax) do ( 2.1 considcrcd_ccll a get_next.ccU_from(linearization); 2.2 currcntwidth => currcntwidth + width(considcrcd_ccll); 2.3 ccllrow[anaypointcr] a considcred_ccll; 2.4 arraypointer a arraypointer + 1; }

3 if (left) (

then /• Cells grow from left-to-right */ 3.1 for (u a l to (arraypointer - 1)) do produce_cell_instance(cellrow[ul); 3.11 left a false;}

else /* Cells grow from right-to-left */ 3.2 for (u a (arraypointer - 2) downto 1) do ( produce_ceU_instancc(cellrow[u]); 3.21 left a true;)

4.1 ccllrow[l] a ceUrow[arraypointer -1]; 4.2 arraypointer = 2; 4.3 currcntwidth a width(celIrow[l]);

3 y.rowlocation a y_rowlocation + cellheight + s;

I end; /* snakefolding */

The basic snake-folding algorithm alternates left-to-right and right-to-left cell abutment to establish a continuous snake, see Figure 18 (page 53). Cell groups with a limited width for either direction are formed in the body of while-statement 2 and temporarily stored in array cellrow. It is assumed that 2.1 terminates the loop and continues with 3 when the input is exhausted. Opposite ordering in neighboring rows is ensured with condition left in 3, together with the updates 3.11 and 3.21. Counter 53 values in 3.1 and 3.2 prevent an instantiation of the last cell such that the width

limitation is guaranteed. Statement 4 copies this last element to the first position of the

array and step 5 updates the required width in y_direction for the procedure

cell instantiation.

Linear Lineup

Cell Group

LL

max Folded Rows

Figure 18: Layout folding.

This procedure generates an input file for the graphics tool containing transposed cell instantiations. The outermost while-loop terminates when both input file and cellrow array are processed. After completing the automatic cell placement, it is necessary to establish all interconnections in a routing phase. Routing is performed by

Magic’s obstacle avoiding router which is an integral part of the interactive layout editor. It is invoked as a command from within Magic and automatically interconnects subcells according to a netlist. The router avoids obstacles such as hand painted power 54 and ground rails. Generally obstacles are defined as layout topologies that match or interact with one of the routing layers or diffusion. In addition, Magic does not route over subcells and this feature makes the router particularly useful in the standard cell context. A detailed description of the router is given in Hamachi's dissertation [1].

The tools are specifically designed for interactive use and do not attempt an optimal solution. It produces a rather conservative interconnection pattern with few tight spots.

One problem must be mentioned concerning the completeness of a given routing task.

Often the router is not able to complete all required nets (~2% failure) and it is up to the user to resolve this problem. For this reason, an interactive netlist tool for routing verification is provided to verify a layout against a given netlist. Appropriate measures can be token after textual advice is output. Completed layouts can then be extracted, simulated and mounted in the target frame.

m.5. Summary

The constructive placement scheme explained in this chapter has shown that standard-cells with a unidirectional data flow ore most appropriate for the implementation of circuit descriptions. For this reason, on applicative language employing prefix notation has been defined and a parsing method was developed for it. A well-formed expression of this language describes a complete circuit. It is evaluated in stepwise fashion starting with the innermost bracket. Results of each evaluation are the immediate inputs to the embracing function. Embracing functions are translated into directly adjacent cells of which one consumes the output of the other. Frequently, results are sent on in this fashion and hence, wire-stubs con be insened between physically adjacent standard-cells for result-propagation. This is desirable because the routing channels are relieved from numerous short range wire- 55 connects. Such a reduction of the wire count in the channels is important, because

routing is a difficult problem (NP-complete) that requires exponential-time. The

following chapter shows a method for maximizing the number of intercell connections

through structural parse-tree transformations, and discusses a technique to select wires

for over-the-celi routing.

Parse-graph optimization is the main focus of the following chapter because, the previously introduced parsing method produces a non-optimal graph. Elaborate graphs result from expressions that contain malformed- or common-subexpressions.

The LL(k)-parser shown, merely transforms a given expression into a threaded tree without acting upon such constructs. The parsing step, that is followed by tree- flattening, pseudo-code generation and layout construction, cannot detect or eliminate the redundant parts. Even though the elaborate expression is conectly translated into a layout, it contains numerous superfluous cells from duplicate subtree expansions. The following chapter focuses on such optimizations with the goal of obtaining a more area efficient and possibly faster layout. * Chapter IV

LAYOUT OPTIMIZATION

IV. I. Introduction

Chapter III defined a functional realization language and introduced the principles for language-guided placement of standard cells. The discussion in this chapter focused on optimizations compilation process for applicative expressions. A syntax-directed parsing method that yields sufficient layouts for small-, or medium- size results was introduced. However, more ambitious designs require optimization measures for higher density and faster logic layouts. This chapter presents graph transformation steps targeted for an efficient, electrically equivalent, possibly improved layout.

Ideally, automatically constructed standard-cell layouts should be as good as those done by an experienced designer, but in reality this is hard to achieve, because any general approach to source expression or corresponding graph optimization is limited by undecidability results [69]3. The problem persists even when slightly restricted optimization criteria are introduced. Even though restrictions may simplify the optimization problem, solutions are still hard to find, because programs can neither practically explore all possible permutations in the problem space, nor use intuition or

3Tbe equivalence questions: "is Ly =L,?", or "is Ly cL ,?" are decidable for regular sets but not for context-free languages [10]p.281. 56 57 experience in the same way a designer can. Hence, a pseudo-optimal solution must be

attempted, and improvements need to be conducted in fractional steps, referred to as

semantic preserving transformations [70].

The idea behind a transformational step is the local replacement of a small

subgraph in the parse tree with another subgraph that is semantically equivalent, but

simpler. Transfoimations are performed sequentially and each modifies only a small portion of a parse structure. True optimizations are extremely hard to achieve,

because individual transformations interact. These interactions make it impossible to tell whether a sequence of transformations terminates or not, and how many steps are required. Consider the case in which a sequence of transformations is repeated because one step reverts the achievements of another. Depending on the sequence chosen, an infinite number of iterations may be performed unless a repetition is detected. In this case, the loop must be exited and the problem instance modified.

Such a scheme yields a suboptimal result but the running time, i.e., the number of required iterations, cannot be determined before hand. This is the major drawback of this method and hence, a fixed sequence of transformational steps should be used instead. Each transformation must perform a certain modification that, in the context of the other transformations, leads to a better pseudo-optimal result.

It is the primary goal of transformational optimization to unriddle effects caused by the language structure rather than to resolve problems caused by ill-constructed expressions. Common subexpression elimination and constant folding are examples of such transformations. The goal of each individual transformation is to convert basic blocks within the directed acyclic graph (dag) independent of their context to achieve a reduction of critical path and area. Transfoimations are based on rules that transform some region of a directed acyclic graph4.

4Even though a parse graph contains cycles through local identifiers, it can be treated as directed acyclic graph, because back-, or cross-edges correspond to wire connections in the final layout 58 This chapter presents optimizations that seek an improvement of the cell count, the propagation delay, and the number of routing tracks in the channels. Redundancy elimination, dag restructuring and peephole optimization are discussed in sections 1.2 and 1.3. Routing considerations that simplify the wiring task are elaborated in the fourth section, and the chapter concludes with a short summary.

IV.2. Redundancy Elimination

Poorly written functional expressions frequently contain redundant subexpressions characterized by "computed" values already available elsewhere in the circuit. Common subexpressions come from duplicate portions of source code that are either explicitly written, or introduced through preprocessor expansions. Explicit redundancy should not be introduced in the code unless it considerably improves the readability of the source expression. Most duplicate subexpressions arise from preprocessor macro expansions. Layouts produced from a program graph should not contain multiple repetitions of the same cell block, simply because area is consumed by cells and wiring. Thus, identical subexpressions must be removed from the program graph, and a directed acyclic graph should be constructed. A directed acyclic graph is advantageous because nodes within the dag having more than one parent denote common subexpressions. It must be noted, that the syntax-directed definition of a dag, obtained with any LL(k) parsing scheme, is incomplete. It may contain subgraphs that compute the some function without explicit backedges or shared identifier nodes. Only explicitly specified subexpressions, marked os such with "=", ore shared among lower level nodes. Common subtree structures that do not expose themselves through multiple parent nodes are more difficult to find. Explicit search algorithms are computationally expensive and graph restructuring is a more 59 appropriate measure. A graph restructuring technique transforms a given incomplete

dag into an almost complete dag, called an improved dag. This section discusses the

refinement steps required for the construction of an improved dag from a syntax-

directed parse tree, as devised in the last chapter.

IV.2.1. Elimination of Right Recursion

Elimination of right recursion improves a given parse tree by restructuring right-

derivated subexpressions. Subtrees with an operator at the root are moved to the left;

this ultimately results in a mostly left-derivated tree. The newly generated tree has a

high structural integrity among internal subtrees.

Left-, and right-recursive constructs are permitted in the grammar, given in Table

1. Recursive rules are purposely included in the grammar because an infinite number

of programs can be constructed with only a few recursive derivations. Essentially,

only one left- or right-recursive rule would suffice to express any product, but for

convenience reasons both exist.

Left-, and right recursive derivations grant a set of very powerful productions.

This power is especially pronounced for rules that govern the the construction of

operations. All directives for a derivation with multiple operands ore fully

recursive with respect to , because refers back to

or . For this reason, one can formulate a circuit by several

syntactically different expressions, allowing for the construction of various

structurally different derivation trees. As an example, consider the logical expression y=abc which is semantically equivalent to y=cbat according to commutative laws.

More elaborate examples involve parentheses used to specify operand precedence, like: y-(a(bc));. Figure 19 shows two different, yet semantically equivalent 60 derivation trees for a simple boolean expression. Even though both trees are

semantically equivalent, a structural difference exists and subtree matching is difficult.

The number of possible source expressions can be reduced by eliminating right-

recursive productions from the original Backus-Naur description of the language.

y - (aA(bAc)); y - (cA(bAa)); V - flND(fl,RND(B,NOT(C))); V-flND(NOT(C),RND(B,fl));

Figure 19: Derivation trees for a simple expression.

Table 3 shows a new rule, called . It clarifies and enforces left-recursion with respect to . Left-recursion is imposed on derivations of the logical operators AND, EXOR and NAND. These functions are commutative and the operands ore exchangeable.

The commutative behavior is obvious by the symmetry found in the operator tables of the functions, specified in Appendix section C . Flip-flops have non- commutative inputs and right derivations are permitted for the functions DFF and

DENA. Integrity among subtrees is ensured because any operator node expansion for a flip-flop allows only an unambiguous expansion. Even though right-recursive rules 61 Table 3: Grammar modified for left-recursion.

::= f*MA!N1 |Nn] END. ::- INPUTS: ; Qn] ::= OUTPUTS: ; jVil ::= e I { x } ::= { I j : := AIBigDIEIFIGIHIIIJIKILIMINIOIPIQIRISITiyiVIWIXIYIZ ::= 1I2I3I4I5I6I7I8I9I0 "" ::= MAIN= ; |Nn1 ::= { = ; [Nn] } ::= NQT( ) I AND( } I EXORf ) I NAND( ) I DFF( ± ) I DENA( x x } I ::= I ::= x I 1 I x

can be prevented by a specially prepared grammar^, a parser for the language should accept a left-, and right-recursive derivations. This is advantageous when a preprocessor is used to expand, and include predefined macros. Preprocessors supply macro definition and expansion facilities necessary for the introduction of new functionality into the language. Macros can be written independently and included in any source if desired, using a scheme similar to C’s preprocessor. Since a grammar describing a language should not reflect a specific preprocessor expansion scheme, a parser should be able to accept left-, and right-derivations. After a parse tree is successfully constructed one must remove all right-recursive constructs from the tree.

sGreibach [7t] has shown that the problem of eliminating left-recursion from a grammar is completely solvabte, and Foster [72] has provided a practical method. An algorithm which can be easily modified to suit right-recursion elimination horn a grammar G, is given in [64] p. 177/A4.1. 62 Even though right recursion elimination can be performed during top-down parsing, it

should be carried out as a sequence of transformations. This is more convenient

because, difficult backtracking steps within an LL(k)-parse are avoided. Backtracking

becomes necessary when the size of a subexpression exceeds k, and the decision for

an operator exchange is pending. The transformation scheme is based on a operator

exchange performed on commutative functions, illustrated in Figure 20.

Treellst

CF(T,F(...)); CF(F(...)rT); Figure 20: Operator exchange to remove right-recursion.

Transformations are exercised with respect to a subtree root which is a commutative function CF. T denotes a terminal node and F is a non-terminal expanded as subtree ST. The derivations: S -»CF{EJE) CF(TJE) CF(T,F(...))

—» ... and S-*CF(EJE)—*CF(E,T)-*CF(F ( .,. ... obtained with the simplified generative grammar fragment: S -> CF( . ) TI F( J ) i » i show the validity of both derivation trees. When replacing F(E,E) with ST denoting a complete subtree with root F, the transformation (RRE): CF(T,ST) -» CF(ST,T) can be specified for the operator CF with an algorithm like the one below. 63 Right-Recursion Elimination Algorithm:

rcmove_rccursion(lree):

begin I. for each node in trcclist do | 1.1 if (node == commutative function && Icftchildnodc == identifier && rightchild.node == function denominator)

1.2 then ( 1.2.1 cxchangc(lcftchild, rightchild); 1.2.2 rcroute_tree_thrcad(lcftchild,rightchild): )/*endif*/ |/*cndfor*/

end: /*rcmovc_recursion*/

The right-recursion elimination algorithm can be described as follows. Just as in previous algorithms, statement one traverses the treelist in descending order, and the body of the loop conducts the node transformation. The conditions in 1.1 refer to the left side of the transformational rule and cause the structural modification of tree fragments in 1.2 only if an exchange is legal and required. 1.2.1 performs the operand exchange by simply swapping the child pointers, and 1.2.2 reroutes the tree thread so that further tree traversals are conducted in proper order. An example of a completed elimination step is illustrated in Figure 20.

IV.2.2. Cascaded Inverter Reduction

The simplest approach to cell-count improvement is cascaded inverter reduction.

Among other available techniques, peephole optimization [64] p.554, [67] p.338 is most appropriate for this type of problem. It is a local optimization method in which subgraphs of the parse tree are treated as separate units. Each unit is composed of instruction sequences that may be replaced by a shorter sequence with equivalent meaning. Sequences are expressed as patterns and one associates a single instruction 64 with each. Whenever an optimization pattern is detected, the corresponding

substitution is carried out. A pleasing factor is that inverter chain structures are

emphasized in the tree by degenerate subtree fragments that contain multiple chained

"NOT' denominators. Figure 21 illustrates two typical parse structures containing

cascaded inverters.

NOT NOT,

NOT, OT

NOT,

..N0T(N0T«E»);

..NOT(OT); OT - N O T«E >); Figure 21: Cascaded inverter chains.

The tree fragment in Figure 21a depicts a directly coupled inverter pair, whereas

Figure 21b illustrates an intermediate identifier. Elimination of tightly coupled inverter pairs is a trivial process, foimulated with the rule CIRa: NOT(NOT(ST)) —>

ST or the matching algorithm below. ST denotes the subtree produced by an expansion of expression <£>, similar to that in the previous subsection. 65 Inverter Pair Elimination Algorithm:

inverter_pair_eliminadon(tiec):

begin

1. for each node in tieelist do ( 1.1 if (nodeid — "NOT' && nodeid.parcni «= "NOT’)

1.2 then ( replace both NOT-nodcs with a tree edge; |/*cndif*/ )/*for-end*/

end: /* invctter_pair_elimination */

Inverter chains, like in Figure 21b, are treated alike. Chains are different from directly coupled inverter arrangements in that identifier nodes are sandwiched between inverters. Since the inverters carry a negated signal, one inverter must be sustained to generate the desired signal. However, inverters "branch off' to the side and do not constitute the main "critical path". 66

..EKP(ld,OT).. ..NOT(OT); Id - ; OT - NOT«EM; OT - NOT(ld);

Figure 22: Inverter chain elimination.

The above transformation preserves the signal for output terminal OT with an identifier in the wire path from expression to identifier OT. A local identifier id must be introduced for the root of subexpression tree , so which OT is derived from id as: OT=NOT(id). Both subtrees are linked together with the expander node

EXP, that is not an operator or terminal node, but merely imposes a parse order for the subtrees. 67 Inverter Chain Elimination Algorithm:

mvertcr_chain_elimination(tn:e):

local:: nodcptr, topptr, cxpptr, idptr : pointers to nodes; id: pointer to nodcn; begin

1. for each node in tree list do ( 2. if (node n is an inverter node) then { nodcptr a pointer to node n; topptr a nodeptr,

3. repeat 3.1 topptr a topptr->parcnt; 3.2 until (topptr points to an operator node);

4. if (topptr points to an inverter node) then {

5 create a new expander node pointed at with cxpptr 5.1.0 replace topptr inverter node with newly created expander; 5.1.1 create an unique identifier id; 5.1.2 create a new identifier node with id-name • pointed at with ld*name - pointed at with idptr;

5.1.3 cxpptr->leftchild » idptr, 5.2 idptr->lchild a nodeptr->lchild; 5.3.1 expptr->rchild a topptr->rchi!d; 5.3.2 create new identifier node with id-name pointed at with idptr; 5.3.3 nodeptr->!child a idptr, 5.4 cfree(topptr); )/*endif-4*/ )/*endif-2*/ }/*endfor-l*/ end;/*inverter_chain_reduction*/

The all-embracing loop (1) of the elimination algorithm tests all nodes along the tree thread sequentially for inverter node matches performed with the conditional statement in step two. If node n is an inverter, then the repeat loop (3) advances a top-pointer with 3.1, until topptr points to an operator node in 3.2. In case the topptr points to an inverter (4), the 1-condition of the rule CIRb is satisfied, and the r- transformation is applied with statement sequence five. A new expander node, which 68 cannot be translated into a cell or a wire, is created and implanted into the tree with

5.0.1. Redirection of the appropriate parent pointer replaces the top inverter by a

newly introduced expander. As soon as the child pointer is assigned, parse tree and

inverter chain become separated. Now, the r-part of the transformation CIRb is carried

out. First, a new identifier node with a unique id is created (5.1.1 - 5.1.2) and attached as left-child to the expander (5.1.3). In the second step (5.2), the expression subtree nominated with <£> in Figure 22, is repositioned and becomes left-child of the id- node. The link conflict is resolved with statement 5.3.3. In the third block, the inverter chain, excluding the topmost inverter, is linked as right child to the expander (5.3.1).

The crossedge in the graph is established with a second identifier node, bearing id,

(5.3.2) that is affixed to the graph in 5.3.3, simultaneously eliminating the subtree. The last step eradicates the dangling, topmost inverter from the graph by freeing the node's memory in line 5.4, which effectively eliminates the inverter chain.

IV.2.3. Identifier Merging

Parse trees contain three types of nodes: operator, expander and identifier nodes.

Operator nodes represent logical functions in the parse tree, and expanders increase the number of inputs available to operator nodes. Because certain functions, like

DBNA, require more than two operands, identifier nodes denote electrical connection points within wire nets. These points are either directly accessible through input/output pins, or shared within the circuit. Often, electrically equivalent connection points are associated with different identifier names, even though a single identifier would be sufficient. In a situation like this, one should replace all electrically equivalent points with one identifier. Since, operations on subtrees require structural and node-content equivalence, a mapping of such identifiers to a single identifier is 69 required. Figure 23 illustrates a tree fragment which contains the equivalent identifier

set S={BJ),E,KJ.The right graph shows the resulting mapping

A /={y|y€SA y-» !*)}, and merging6.

Figure 23: Identifier merging of set 5 = {BrD,E,K,L}

Merging is based on disjoint set union and tree collapsing. The algorithm compress starts with the construction of n unary sets S, necessary in further union-find operations. Afrer all identifier sets are available, the tree is traversed node-by-node.

Simple identifier chains are detected through tree-edges that link parent-, and child-

6A/* denotes a chain of identifier nodes with at least one node that contains element x as identifier. Mm - » |.v 1 replaces any chain Af* with exactly one node containing only identifier*. 70 identifier nodes (2.1). Since linked identifiers are electrically equivalent (expressed by the edge between the identifier nodes) both sets Sj and Sk containing the nodes must be determined (2.2). Find(i) returns u, such that i€Su. The set union Sj=SJuS i replaces the set Sj in 2.3, so that Sj or Sk no longer exist independently. After all sets 5k are disjoint, a unique identifier for each individual node is created with statement 3.1.

Now, all node names within the tree can be replaced with the newly introduced identifiers id associated with each set S. This eventually leads to chains of duplicate identifier nodes AT which need to be collapsed7 with the statements embedded in the loop construct 4. On termination of compress, all identifiers inherent to the graph are merged.

Collapsing rule: If/ is an identifier node containing the same information as its parent k. than makej child of the parent of k and remove node k. 71 Tree Compression Algorithm:

compiess(tree):

begin

1. for each identi Ger create a set Si, that contains as single element the identifier itself,

2. for each node n in the tree do { 2.1 if (n is an identiGer node && n->parcnt is an identifier node) then { 2.2.1 j afind(identifierofn); 2.2.2 k a find(idenlifier of n->parcnt);

2.3 Sj a union(Sj,Sk); l/*end-if*/ |/*end-for*/ 3. for each setS do ( 3.1 create unique id; 3.2 traverse tree and replace each identifier in S with id; }/*cnd-forV

4. for each node n in the tree do ( if (n a n->parcnt then { collapse parent node with child; }/*endif*/ )/*end-for*/

end;/*compress*/

IV.2.4. Restructuring

Profitable optimizations, such as common subexpression elimination and subgraph matching require a high integrity within the parse structure. Structural integrity is achieved through reorganization steps, like lexicographical leaf sort and sinistral tree adjustment. Lexicographical leaf sorting arranges terminal identifiers among lowest level nodes. The permutations ensure detection of large isomorphic 72 subtrees8. Structural isomorphism within the tree is established by left-handed

(sinistral) tree adjustment. Sinistral tree adjustment swaps physically larger subtrees attached to a commutative operator node to the left. This operation yields trees with large preorder depth. Trees with great preorder depth are advantageous because many operator pairs are generated along the left-handed trail. As the discussion of routing considerations at a later point in this chapter will show, numerous short-range or intercell connections can be removed from this chain,

Lexicographical Identifier Sort:

Lexicographical identifier arrangement establishes matching node structures at the lowest level of the expression tree. This is important, because isomorphism of semantically equivalent trees of equivalent structure may remain unnoticed when some leaf terminals are permuted with regard to some commutative operator.

Isomorphic structures are obtained from the graph with either operator exchanges or transformations.

Leaves in an expression tree should be exchanged iff their immediate parent operator n represents a commutative function with the rightchild(/r) ranking lower than the leftchild(n). The ranking order is that of a lexicographically ordered set. As an example, consider the leaf transformations for some basic node configurations, illustrated in Figure 24 (page 74). The set of transformations allows permutations for subtrees with up to three identifier nodes, and is sufficient for most expressions.

Similar to production grammars, a reconfiguration sequence is complete when no rules

8T w o graphs G ^ V , ^ ) and G ^ V i.E j) are said to be isomorphic, if there exists a one-to-one and onto function from the nodes of one graph to the nodes of the other, such that the corresponding nodes are equivalent. This is f: V, -> V2, such that («,v)e £ , iff |/(u ),/(v )} e E2;. Two nodes are equivalent if they have identical node contents, meaning identical operation codes and equivalent inputs. 73 can be applied anymore. The achieved structure is then the semantically equivalent result. Note, that an explicit rule for the transformation (LEX3):

... Fl(F2(a,b),c) Fl(F2(c,b)&) ... to satisfy the condition ord(c) < ord(b) < ord(a), is not necessary because of the rule (LEX2): ... F2(ajb) . . . [with ord{a)>ord(b)]-* . . . F2(bja )...; and hence (LEX3.3): . . . Fl(F2(b#),c) . . .

... Fl(F2(c,b),a)....

An algorithmic description for a lexicographical identifier sort is merely a tedious repetition of the transformational rules, and hence omitted. Simple graph adjustment for terminal identifiers was presented in the previous paragraph. In what follows, adjustments for nodes inherent to the tree are offered. Equivalent subgraphs with structural differences still exist and sinistral tree adjustment is required to establish isomorphic structures. 74

NOT

LEX1 .. NOT(id)..—* .. NOT(id)..

.. F(id1,ld2).. Aord(fd1)> ord(ld2) A(F(idl|ld2) » F(id2,1d1)} —► F(ld2,ldf)..

.. FI(F2(ld1,ld2),ld3).. Aord(ldl) - ord(ld2) - ord(ld3) —► FI(F2(ldl,|d2),ld3).. LEX3.0

.. F1(F2(ldl(id2),ld3).. Aord(ldl) <■ ord(id3) < ord(id2) LEX3.1 A{Ft(F2(ldt,ld2),ld3) — . Fl(F2(ldt,id3),ld2)} — F1(F2(tdl,ld2),ld3)..

Figure 24: Lexicographical tree transformations. 75

.. F1(F2(idlltd2),ld3).. Aord(id2)< ord(id3) <- ord(idl) LEX3T2 A (FI (F2(ld 1 ,id2)»ld3) — F1(F2(ld2,id3),idt)) —► F1(F2(id2,rd3),td1)..

.. F1(F2(ldl,ld2),Id3).. Aord(ld3)< ord(idl) <- ord(ld2) LEX3.3 A(F1(F2(idf,id2),id3) » FI(F2(ld3,id1),id2)} —► ..FI(F2(!d3,ldl),ld2)..

.. F1(F2(ld1,id2),ld3).. A ord(ldS) < ord(ld2) <- orddd 1) LEX3.4 A (Ft (F2(td 1 ild2Md3) — FI(F2(id3,ld2),ld1)} —► Ft(F2(id3,id2),ld1) ..

Figure 24: Lexicographical tree transformations, 76 IV.2.5. Sinistral Tree Adjustment

Sinistral tree adjustment strengthens left*derivations in the expression tree. The derivations are stiffend by moving the larger subtree attached to a commutative operator node to the left. Figure 25 shows two subtrees, Sj and S2, affixed to a commutative operator CON.

CON CON

STR

S

CON CON

STR

S.

Figure 25: Sinistral tree adjustment of two subtrees Sj and 5,. 77 Both subtrees are of different size and the permutation in 25a requires only a size comparison. The statement sequence 4.2 in algorithm sinistrorse is responsible for the swap, based on the conditions |S ,|< |S 2| iff CON is commutative. Note, that the structure is preserved for the case |S, | > |S2|, as illustrated in Figure 25b. Naturally, the tree thread must be adjusted after a subtree exchange was carried out (4.2.2).

Suppose the size of Sj is equal to S2, then the size criterion is not sufficient anymore and the synthesized strings for the childnodes of CON must be taken in account.

1^1 HIS, | ord("ABLM) > crd(MABA") ord("ABL") > ordf'ABA”)

Figure 26: Subtree exchange of equivalent size subtrees with different synthesized strings.

A permutation of subtrees is conducted iff CON is a commutative node and the synthesized string of the left childnode ranks lexicographically higher than the string attached to the right childnode, see Figure 26. Step 4.3 in the algorithm tests the necessary conditions, and 4.3.1 attains the subtree swap. As in the previous step, the tree-thread is rerouted with the procedure invocation in step 4.3.2. At the end of sinistrorse, the entire tree is orderly arranged with regard to the internal identifier and operator node structure. 78

Sinistral Tree Adjustment Algorithm:

sinistroree(tree): begin

1. comprcss(lrcc);

2. trees izc( tree);

3. nominate(trce);

4. for each node n in the tree do { 4.1 if (n is a commutative operator with two operator nodes as children) then {

4.2 if ( ((n->leftchild).count) < (n->rightchild).count)) then ( 4.2.1 swapfn.Ieftchild, n.rightchild); 4.2.2 adjust_tiee thread(n); }/*end*if*/

4.3 if (((n->leftchild).count = (n->rightchild).coum) && (ord((n->Ieftchild).leaf_al tribute) > ord((n->rightchild).leaf_attribute) )) then { 4.3.1 swap(n.leftchild, n.rightchild); 4.3.2 adjustjree_ihread(n); )/*end-if*/ )/*end-if*/

cnd;/*sinistrorsc*/

The algorithm sinistrorse starts with the construction of an orderly numbered and concise tree in steps one through three. Within the body of loop 4, each node in the

tree is processed. Step 4.2 indicates a childnode exchange in case the right subtree is larger, and 4.3 treats subtrees of equal size. If required, root nodes are exchanged

based on lexicographical leaf-node properties. Note, conditions 4.2 and 4.3 may be combined, resulting in a shorter code segment. Both swap and adjust sequences (4.2.1,

4.2.2; 4.3.1,4.3.2) are separated for clarity. 79 Figure 27 (page 80) gives an example of a complete sinistral tree transformation.

Each node in the illustration contains the subtree size, which is required as a

permutation criterion in the step 4.2. Located beside each node is either a terminal

identifier enclosed in single quotes, or a synthesized identifier string captured in single

quotes. "ACKU"(IQ), indicates a subtree root with a total of nine child nodes. Figure

27a shows the expression tree before adjustment, and Figure 27b shows it after

sinistral adjustment.

Note, that the insertion of "AC" into "AKU", or vice versa, performed in nominate

• 4.1 considers *A* only once to form "ACKU". This is important because, 'A' can be

viewed as a crossedge, and " AACKU" would falsely overvalue the identifier 'A'.

"AfC’(4) is a non-commutativc node marked by hatching and hence, *A’(1) and "K"{2)

are not swapped. In contrast, parent "AKU'\6) permutes its children "/Mf'(4) and

*C/’(1) by applying 4.2 in sinistrorse. Similarly, "ACKITX10) exchanges ”AC"(3) and

nAKU"(6). Since, the subtrees of "AC"(3) are of equal size and condition 4.3 becomes

. true, resulting in the swap of ’C'(l) and ’A ’(1). The children nBM"(4) and M r'(4 ) are

treated alike. As the illustration shows, a semantically equivalent subtree for an

expression containing "ABCKMTIT\2Q) is the result. The steps 4.2.2 and 4.3.2

indicate that each swap must be followed with an adjustment of the tree thread. 80

20

'AC K ir. ’ABMT"

'AC' ’AKU1 AT"

•AK' ’A"

'ABCKMTU' 20

’ABMT*

7U

AK'

Figure 27: Left adjusted tree structure. Subtree Size Determination:

The algorithm sinistrorse invokes the function treesize in step two. It returns the tree passed to it with the subtree size attached to every node. Each node is considered as being the root of a subtree, having two possibly empty subtrees attached to it. The size of the tree for expression ..F(i,k), for example, is three for the operator node F, and one for either terminal node. Since, both subtrees can be empty (leaf nodes), the root node is included in the subtree size, i.e.: the size of fragment ..7*,* is one.

The size of a tree structure with a root node n is computed with the function treesize. Subtree size attributes, referred to as tree.count, are recursively synthesized and implanted at each node. Left-subtrees are completely explored in preorder (2.1) to obtain the precise count. Line 2.2 updates node counts for the left-path discovered in

2.1. At this point, count values are short by the number of nodes found in right- branched subtrees. An appropriate adjustment is made through exploration of the right subtree structure with call 3.1.1 and count update in 3.1.2. The recursion terminates the instant a leaf node (terminal) is reached. The non-recursive stopping case 1 simultaneously initializes the leaf-count value to one. 82 Subtree Size Determination Algorithm:

trcesizc(trce):

begin

1. if (dee « leaf) then ( trce.count = 1; }

2. else ( 2.1 trccsizc(trcc*>leftchild); 2.2 trce.count = (lrce->lcftchild).count+l;

3.1.0 if (trce->rightchild o nil) then ( 3.1.1 trccsizc(tree->rightchild); 3.1.2 trce.count a trce.count + (trcc->rightchild).count; |/*cnd-if*/ )/*cnd-if*/

cnd;/*trccsizc*/

Synthesized node names in the tree are important for decisions involving equal size subtrees and their exchanges. Step 4.2, of sinistrorse utilizes this information and subtree swaps are based on the lexicographical evaluation of synthesized strings. The tree fragment depicted in Illustration 28, gives lists from terminal identifiers. 83

ADKTU

..NAND(AND(NOT(D),A),K); K=NOT(AND(T,U));

9

Figure 28: Synthesizing leaf-identifier attributes in a tree fragment.

In the synthesis process, leaf nodes are initially tagged with their immediate identifier literals. As the synthesis process proceeds, strings in parent nodes are created and expanded with leaf node information carried towards the root. Parent nodes are tagged with ordered lists of literals enclosed in double quotes. When the algorithm nominate terminates, all attribute fields are tagged with an appropriate list.

Double literals are purposely avoided so that all identifier nodes receive the same attention. As an example, consider the nodes "/1C"(3) and "AKU" (6) in the left-subtree of the original graph, shown in Figure 27. Line 4.1 of nominate merges both lists, which results in "ACHT£/"(10). The double evaluation of literal ’A' is prevented by a set function. This is sufficient, because a sole ’A’ signals the identifier presence and avoids complicated list comparison schemes. 84 Name Synthesis Algorithm:

nominatc(tree):

begin

1. for each identifier id do { 2. ptr b pointer to id-nodc; ' 3.0 repeat ( 4.1 insert identifier id into the attribute list of ptr.Icaf_attribute; 4.2 ptraptr.parent; 3.1 until (root node is finished); )/*for-end*/ cnd;/*nominate*/

The code of nominate sequentially processes every leaf-identifier with the loop statement 1. Note, that the order in which identifiers are processed is not important.

For each iteration, the local pointer ptr is assigned to the corresponding terminal node

(2). Loop statement three conducts the tree walk to the root. Nodes on the path from leaf-to-root are sequentially visited by ptr. Whenever a new node is referenced, the leaf attribute field of this node is modified. Proper insertion of the identifier into the list is captured by the statement series 4.1. It assumes that all identifiers are of type character. Lexicographically ordered strings are formed using an insert operation.

Movements of the ptr are conducted with a pointer-parent assignment in step 4.2.

Each tree walk teiminates in 3.1 as soon as ptr encounters the root. During the operation, all nodes in the tree are tagged with leaf-identifier strings. When the algorithm terminates all nodes are tagged with ordered identifier strings, as required in sinistrorse. Figure 27a shows a completely tagged expression tree. 85 IV.2.6. Common Subexpression Elimination

Common subexpression elimination is defined as the process of finding and removing redundant code segments from a program expression, and sharing of the produced value by the defining occurrence. Since a syntax, or parse, tree is a graphical representation of an expression in preorder foim, common subexpression elimination is equal to subtree matching. The elimination step corresponds to the substitution of the matching tree fragment with a shared node and the disposal of the tree fragment.

Subtree matching is carried out with the tree identity algorithm isomorphic determines whether two trees specified by their roots are identical, [73] p.244. Even though the graph isomorphism problem is solvable in polynomial time (because the graph is planar and directed [29] p.285), the algorithm below requires exponential time. 86 Tree Identity Algorithm:

Isomorphic(NI*N2):

begin

1. if (treeJs_empty(N 1) && tree_is_empty(N2)) then return(tnie); else

2. if (trec_is_empty(Nl) II trec_is_cmpty(N2)) then tetum(false); else

3. if (item (Nl) o item (N2) then rctum(false); else

4. retum( isomorphic(Nl.]eftchild, N2.1cftchild) && isomoiphic(Nl.rightchild, N2,rightchild) );

end; /*isomorphic*/

The code for isomorphic recursively explores both trees Nl and N2 from the

roots down, in parallel fashion. Structural conformity is determined with the

conditional statements one and two. Node equivalence, constituting the second

necessary condition for graph isomorphism, is tested with the third comparison. Step

four invokes the exploration of both child subtrees and returns a combined value of

success or failure to the previously calling level. In the event that both subtrees N l and

N2 are isomorphic, the value true is returned by all calls, including the first

invocation. Note, that the code is independent of node positions and, hence, all combinations along the thread can be investigated. The identity algorithm is essential to find all pairs of isomorphic subtree roots within a partial dag. Similar to index movements in a selection sort, two node pointers *m and *k progress along the tree 87 thread. Each combination, referred to by *m and *k, is tested for tree equivalence

assuming that m and k are roots of isomotphic trees. If both trees are indeed

equivalent, a shared node is introduced for one of the trees.

The common subexpression elimination algorithm utilizes the function

isomorphic as condition in step four. Function allpairs finds all matching subtree pairs

in the tree, and eliminates one in each case (4.3). On termination of the code an

improved dag is returned.

Common Subexpression Elimination Algorithm:

aUpairs(trcc):

local:: m.njc: node.numbers; *m. *lc: pointer to numbered node; u : identifier node; begin

1. n « renumberation(tree); /♦number all nodes in the tree thread*/ m « I;

2. while ( m

3. while(k do {

4. if ( isomorphism. *k)) then { 4.1 create node u for subtree k that shares m; 4.2 include u in the tree thread; 4.3 dispose(*k); 4.4 n « renumberation(tree); k = k-l; }/*end-if*/ k»k+l: )/*wend*/ mam+1; )/*wend*/ end;/*allpairsV

Statement one of allpairs calls the function renumberation that sequentially numbers all nodes along the tree thread. It starts with zero at the root node and returns 88 the highest node-count values for loop control. The enumeration variable m is

controlled by the outer loop (2), and *m is advanced after each completed pass over all

nodes with k. For each new state achieved by *m and *k positions, the question: "are

both trees pointed at isomorphicT is asked, and answered by statement four. If both

trees are equivalent, than a shared node is created to replace the left-subtree root in

4.1. After the tree-thread is adjusted with 4.2, one disposes the ^-subtree with 4.3.

Because this action alleviates the tree from several nodes, n is recalculated and the

tree-thread renumbered (4.4). Proceedings of the algorithm are visualized in Figure 29

(page 89). It shows an expression tree with common subexpressions rooted at Ej, E2

and E y Consider illustration 29a in which *m and *k refer to Ej and E2 respectively.

isomorphic(*m,*k) manifests, for this configuration, matching subtrees and creates in

4.1 a node U with a pointer to E ^ , Figure 29b shows the situation just before subtree

E2 is disposed, and 29c depicts the same tree fragment after thread adjustment

(4.2-4.3) and renumberation (4.4). Note, that node H gets node number eight, after the disposal of E2 and the freeing of F2. Illustration 29d, shows the completely renumbered tree with pointers adjusted for subsequent tree elimination steps. Clearly, subexpression £ j is a candidate for elimination in similar fashion as soon as k reaches node 12. Figure 29e shows the completed tree after elimination of the subtrees E 2 and

%• After pointer *m encounters the end of the thread, the desired improved dag is generated. It may still contain undetected common subexpressions that appear as structurally different, but semantically equivalent graphs, like ..NOR(NOT(A,B)..) and

..NAND(NAND(A,B),A)... For this reason, local transformations are presented in the following sections.

9U will make reference to a new and unique identifier introduced as parent of Ej and child of C. •IB I

Figure 29: Common subexpression elimination. 90

12

Figure 29: Common subexpression elimination. 91 IV.3. Peephole Optimization

Peephole optimization is the simplest optimization method, and is used in many compilers. It attempts to overcome inefficiencies introduced by syntax directed parsing. The strategy is to combine instruction sequences into single or more efficient sequences having the same effect. Operator combinations that match a certain sequence are searched for within a small window (peephole) covering only few nodes.

Whenever a sequence is found it is either restructured or replaced, resulting in a smaller dag, a faster circuit or both. Two basic strategies for peephole optimization are available. Davidson [74] introduced instruction optimizations based on register transfers, and Tannenbaum’s [75] strategy is recognition of predefmed patterns. Since, the "code" generated here is placement information, register transfer optimization is not meaningful. However, Tannenbaum’s approach based on the recognition of predefmed patterns and associated transformations proves applicable. Several transformations that prove useful for dag improvement are listed below.

Algebraic simplification is a standard way to improve target code. Traditionally, one applys algebraic identity transformations to programs solving numerical problems.

But the method can be adapted for placement improvement. The transformation is best explained with an algebraic example. Consider a multiplication by two, i.e.,

A 2*A; one could do a multiplication, or replace the operation by an addition or a binary shift. Similarly, a set of logic transformations can be specified for the replacement of small subgraphs in the dag by other functionally equivalent, but simpler, subgraphs. Darringer and his coauthors [57,58] give a representative example for a set of simple digital local transformations10, much like Tannenbaum's set of

,0Daningcr and coauthors suggest “LSS: A System for Production Logic Synthesis". Here, synthesis is understood as production sequence to obtain feasible, but not necessarily optimal networks of technology specific boxes that satisfy a large number of constraints from a register-transfer level language description. 92 numerical algebraic transformations. Consider the transformations: ..NOT(

AND(A,B))..->..NAND(A,B).. and ..OR(AND(NOT(A),B),A)..-»..OR(A,B).., under the assumption that an operation OR(X,Y) exists. Obviously the resulting operations are more efficient.

Reduction in strength replaces expensive operations with equivalent cheaper ones. For example, x2 should be implemented as x*x rather than calls to expensive math-library functions, such as exp(2*ln(x)). Cheaper in this context refers to the number of standard cells involved in the implementation. A simple reduction in strength, like the 'cascaded inverter reduction' ..NOT(NOT(A))..-»..A.. was already discussed in the previous section.

Constant assertion treats constructs that compute a logical value, e.g.,

„AND(NOT(A),A)..->..LOWL; or ..EXOR(NOT(A),A)..->..HIGH!... The logical constants LOW! and HIGH! correspond to tie-off cells which contain a circuit that produces logical levels without noise disturbances. Direct connections to power and ground rails must be avoided in CMOS-design at all cost, because transients on the power lines may damage the cells.

Constant folding is similar to assertion, in that a logical constant is employed in the result generation. Yet, it is different, because the output is either the input itself

..AND(A,1)..—►..A.., or its negation ..EXOR(A,l.NOT(A)... Folding may be required because of two reasons. First, the designer seeks to introduce a 'delay', which is completely meaningless in a retimed synchronous system [76]. Second, the constant was created by a previous constant assertion step. In the second case, the following peephole optimization sequence should occur. ..EXOR(EXOR(NOT(A),A),B)..-»

..EXOR(HIGH!,B)..-> ..EXOR(B,HIGH!)..-> ..NOT(B)... From this, it is obvious that each peephole improvement may spawn opportunities for additional improvements. Thus, repeated passes of the window over the dag are necessary to get the maximum benefit. 93 Code depth improvement is concerned with the number of transistors in the

actual implementation or the number of logic stages in a design. Since this

transfoimation directly effects the lowest level of the design, it should be performed

towards the end of an optimization sequence. Suppose, that

..NOT(AND(A,B)).. —»..NAND(AJB)..; then two inverter levels, respectively four

transistors are removed from the layout. This is a significant improvement, because

two inverter delays are eliminated from the signal path and a considerable amount of

area is saved, as shown in Figure 30.

Complement utilization takes advantage of cell layouts which provide the output

signal and its complement. This allows the removal inverters at the lowest level. The

introduction of a wire connection to the inverted output makes the inverter

superfluous. Most effective reductions are observed with sequential circuits employing

flip-flops, such as DFF and DENA. Figure 30 shows an example of a NAND(A,B)

implementation. The transformation eliminates two transistors from the signal path.

This is possible because the desired function, NAND(A,B), is available at terminal -2 as an intermediate result.

General transformations are based on cell specific properties for the optimization of cell count or circuit speed. Two methods fall in this category: extended algebraic transformation of functions that seem unrelated (Figure 31), and function mapping to intermediate results in subcells (Figure 32). Algebraic transformations introduce on additional level into the logic. However, the total cell width is smaller and two neighborhood connections (doglegs) become possible. Furthermore, complement utilization may apply to the embedded inverter depending on the particular NOR- function implementation. 94

a

8 Transistors In the signal path.

NOT(..)

0 — 1

A

4 Transistors In the signal path.

Figure 30: Code depth improvement of an AND-NAND structure. n

E

Figure 31: Extended algebraic transformation.

Another opportunity for peephole optimization is the mapping of explicitly stated functions to substructures of existing cells that contain optimized subfunctions. Refer to the EXOR implementation shown in Figure 32. An EXOR cell may be utilized to replace other logical constructs. Control points within cells are made available, and internal signals can be taken off a cell with conductive layers masked on top of the cell similar to a dogleg. It must be noted that over-thc-cell wiring is effected by this action, and up until now, few attempts were made to guide signals using hand routing.

Figure 32: Functions available within an EXOR-cell.

Other peephole optimizations such as unreachable code elimination, multiple jump replacement, redundant load and store removal and the use of machine idioms are not meaningful in this context. Unreachable code is removed during construction, and hence, no explicit removal action is required. Similarly, multiple jump replacement or unlabeled instruction removal immediately following unconditional 96 jumps is irrelevant in conjunction with this (restricted) functional language. Gearly,

transformations that improve register utilization or rearrange instruction sequences for

a certain machine or hardware configurations are meaningless. Compilers that produce

machine code for a statement like: A A+l; try to use autoincrement on a register

instead of a load-store sequence and explicit constant addition through register

operations. Because, our target code is not executed and merely constitutes placement

information (inline code without register involvement- similar to a serial register only machine), register utilization techniques are useless.

Many experiments have shown that peephole optimizations improve a dag, and with it the corresponding layout. However, one problem concerning the window size and the completeness of the transformations remains. A transformation is said to be complete or closed, iff it is bidirectional. This means a full boolean equivalence must exist for all transformational rules. Peephole optimization rules must be devised intuitively because the minimum equivalent expression problem11 is NP-hard. This, and problems in tree-template matching restrict the number of nodes a peephole transformation can consider at each step. It is called the window, which is of limited size.

11 Minimum Equivalent Expression Problem: Minimum equivalent expression: A well-formed Boolean expression E involving literals of the set V of variables, the constants true (HIGH!) and false (LOWI), and the logical connectives a , v ,- i ,-*. Question: Is there a well-formed Boolean expression F that contains K or fewer occurrences of literals, such that F is equivalent to E, that is, such that for all truth assignments to V the truth values of F and E agree? (Problem definition taken from [29] pl61.) 97 IV.4. Routing Considerations

In the floorplanning phase of integrated circuits large rectangular routing areas are designated for interconnections among standard cells and context circuitry.

Normally, connections between standard cells are carried out in channels located amid or beside cell rows. Since, routing channels between cells are limited in size and wires may enter from all four sides of the rectangle, switchbox routing is required.

Switchbox wiring is a more restrictive version of channel routing and sometimes referred to as four-sided channel routing. It is more difficult than general channel routing and falls into the category of NP-hard problems. Burstein [28], gives several references to heuristics developed for the problem.

One possibility for speeding up the routing process is avoiding wires in channels.

This is desirable because the complexity of all suggested algorithms depends on the number of nets that must be routed. For this reason, one should remove as many wires

(dual terminated nets) as possible to achieve a maximal improvement. For this purpose, intercell connection stubs, as shown in Figure 39, and horizontal routing tracks on top of the cells are available. Figure 38 illustrates the six over-the-cell routing tracks reserved for short range connections. Result propagation between adjacent cells is carried out with intercell stubs, called: doglegs. Intercell connections are easily spotted in the dag because physically neighboring cells in the layout correspond to adjacent operator nodes along the tree-thread. The edge between node

16 and 17 in Figure 33 represents a dogleg connection. It connects the output of

EXOR_17 with the left-input of DFF_16. A very similar situation exists for NAND_1 and AND_3. Generally, adjacent operator nodes whose order numbers are equivalent or differ by one should be directly connected with a dogleg stub. The advantage of this measure is that one track in the routing channel and numerous vias become 98 superfluous. Note, that the dags in Figures 33 and 34 are trimmed for illustrative

purposes, and do not reflect the results of a complete optimization series. Nodes four

and five, for example, should appear swapped with regard to AND_3. This permutation would result in a larger right derivation tree rooted at node three, and in

an additional dogleg between DFF_6 and AND_3.

Candidates for over-the-cell routing are wires connecting cells in close proximity, allowing for a larger number of leads to be moved onto the very limited number of tracks. Even though only short wire jogs are eliminated from a switchbox, over-the- cell routing is advantageous. It divides the horizontal routing task into independently solvable conventional switchbox-, and collinear-routing12. Raghavan and Sahni give an algorithm [77] that finds an optimal solution for the general collinear routing problem. Their algorithm guarantees an optimal routing pattern through an enumerative process in polynomial^time and space. Hence, each wire that can be moved out of the channel reduces the exponential explosion of the switchbox problem by one in the exponent however, causing only a linear increase for the over-the-cell routing complexity.

The algorithm selectjedges returns all candidates for over-the-cell routing in the set Sover Elements in the set are either right-derivations of operator nodes, or shared identifier connections in the dag. Right-derivation edges are similar to dogleg connections in that an operator input directly connects to an operator node at a lower level. As an example, consider the configuration of the nodes one and eleven, in

,2The over-the-cell routing problem is actually duo-collinear. As Figure 40 visualizes, there exist two rows of equivalent terminal points above and below the routing tracks.

l3The time complexity is 0((2k)\*k*n*log(k)) and its space requirement is 0((2k)[*k), with Jt being the count of available tracks, and it (be number of terminals in the single row instance. Note two important observations: the complexity is O(n) for a fixed number of tracks k, and the duo-collinear problem is not as difficult as the one solved here. In the duo-collinear case two rows of terminals exist. 99 Figure 33 (page 102). The edge between the output of node NOT_l 1 and the right- derivation input of NAND_1 cannot be translated into a dogleg. The left-derivation

NAND_1 to AND_3 already claims the intercell connection and thus the connection

NOT_l 1 to NAND_1 becomes a candidate. It only qualifies as a candidate because the lefit-derivated subtree of NAND_l may require all six routing tracks. An algorithm for over-the-cell wire selection for a directed graph is given below. sclect_edges has two parameters one of which is the improved dag {tree), and the other is the set of selected wires, returned in set Sm,tr. 100 Over-the-cell Wire Selection Algorithm:

select_edges(tree, Sover):

local :: ptr, mptr: pointers to nodes; cnt: integer; begin

Sover b clearsctO; /* deletes all elements in the set */ cnt a 1;

1 /* right derivation edges */

1.1 for each node in tree list do { 1.2 if (node n is an operator node) then { ptr a pointer to node n; mptr a ptr->parcnt; 1.3 while ((mptr->parent o operator) && (mptr->parent o root)) do { mptr a mptr->parent; }/*wend*/ 1.4 if (mptr-node is a right child of an operator node) then { 1.5 Sover a union(Sovcr, [cnt.ptr.input, (mptr->parent).rightchild)]); cnt a cnt+1; | ;/*endif*/ l^cndif*/ |;/*end-for*/

2 /‘back and cross-edges*/ ptr a pointer to rootnode; 2.1.1 while (ptr o nil) do { 2.2 if (ptr points to an identifier node) then ( mptr a ptr.ncxtjhrcad; 2.3 while ((ptr->name o mptr->namc) && (mptr.nextjhrcad o nil)) do { mptroptr.next.thread; }/*wcnd*/ 2.4 if (ptr-name a s mptr->name) then { 2.5 Sover a union(Sover, [cnt,mptr->parent, ptr->parent]); cnt a cnt+1; f^*endir*/ )^*endif*/ 2.1.2 ptr a ptr.next_thread; }/*wend*/ retum(Sover);

end; /* select.edges */

The algorithm select_edges returns potential candidates for over-the-cell routing.

It determines, in the first step, all right-derivation candidates in the dag. Statement 1.1 conducts a tree walk along the thread, and 1.2 initiates a search for its closest operator 101 parent node (1.3). If node n is a first order right-derivation subtree node (1.4), its connection is included in Sm,er Each element in the set is a triplet composed of a wire identifier number cnt followed by two node numbers. The net identifier is important because more than one connection may exist between two nodes. Multiple copies of identical elements is prevented through simple set union, as in step 1.5 and 2.5.

Figure 34 illustrates the necessity for parallel edges. Node 17 refers back to node 13 for the child-identifiers A and Q4. Both are candidates for over-the-cell wiring, as determined in step two. Identifier sharing refers to the detection of back-, and cross- edges. Since wiring information is gathered from the graph, name references between nodes must be considered. The second program statement guarantees correct back-, and cross-references to the closest matching identifier with regard to the preorder node numbering. It traverses the dag along the thread stoning at the root (2.11, 2.12) and examines each identifier correspondence in the body of 2.2. A second subtree walk with pointer mptr is conducted in 2.3. In case there exists a matching node (2.4), an edge to both parent nodes of the matching identifier nodes is included. doglegs: <♦) i -11 I - 3 3 - 6 6 - 7 13 - 16 I I - 13 16 - 17

Figure 33: Edges for short range wiring. 103

Figure 34: Backedges to closest identifier. Wires for over-thercell routing are chosen from the set Smer based on their

individual length. Short wires are selected first and placed in the outermost tracks.

Consecutive steps consider wires of increasing length. Track assignments are made

immediately for each selection, from the outside tracks inward until no assignment is possible anymore. The track configuration shown in Figure 35 was obtained with the

algorithm assigntracks.

EXORJ 7 DFF _18 AND_13 NOT_1 1 EXORjr DFF_8 AN0_3 NAND _1

III

U til

Ull

■data tu.iO adoa * dlatanca - attafu-ul

1 - I I II 10 1-13 III 12 s- a 1 3 3 - 7 V 4 3 - 7 vi 4 a - 16 Ull is - ia III 3 13-17 IH 4 13-17 Ulll 4

Figure 35: Over-the-cell track assignment. 105 Track Assignment Algorithm:

assign_tnicks(Sover):

begin

for ail edges c do { distancc[e] ■ abs(u.nodenumbcr • v.nodenumber); ) repeat ( select edge e with (be smallest distance value; if (more than one edge possible) then { select cither edge fore; } if (there exists a track from cell u to v for wire e) then | place e on a track closest to the terminals; } );/*rcpcnd*/ until (no edge c can be placed anymore);

end; /*assign_tracks*/

The complexity of above greedy strategy is O(n). Even though the method seems quite primitive, numerous experiments have shown that it yields exceptional results.

IV.5. Summary

This chapter discussed optimization steps for an expression tree generated with the syntax-directed method presented in Chapter HI. The goal of this chapter was to give an efficient method for parse-tree improvement that results in a faster and smaller layout. All optimizations were performed on the cell level, either reducing the total number of building blocks involved or replacing cell-groups with more efficient combinations. After a brief introduction, the construction of an almost complete directed acyclic graph was presented. Several essential construction steps, such as right recursion elimination, cascaded-inverter reduction, identifier merging, tree restructuring and common subexpression elimination were shown. Peephole optimizations on the dag, which are conceptually the simplest but most effective improvement measures, were also introduced. Furthermore, two basic principles of 106 grammatical-, and local- transformations were explained and applied for structural dag

enhancement. Wiring improvements, inspired by the excessive cost of switchbox

routing, were discussed in the preceding section.

The optimization scheme introduced in this chapter reaches its limits when no

replacement of a cell group or subfunction leads to any further improvement. Even

though impressive reductions of the layout cost can be obtained with the

transformational optimizations described, cell-boundaries are the natural limitation of

this approach. For this reason, the design hierarchy must be extended to a lower level,

so that the cell boundary limitation is at the designers discretion. A custom standard­

cell design method is most appropriate because area and critical path considerations can be integrated in the process. As an example, consider macro definitions that are

normally expanded by the preprocessor into area consuming cell arrangements. An efficient standard-cell can be designed for each macro to obtain a smaller project area and a higher maximum frequency. The following chapter expands on this idea and presents a construction scheme for standard-cells. Chapter V

DESIGN OF STANDARD CELLS

V.l. Introduction

This chapter is concerned with the design of standard cells and gives a brief

introduction to layout configuration and simulation issues. Much of the detail in this

chapter is intended to provide a study of how transistor level diagrams must be cast

into standard cells under rigid constraints imposed by cell context and by fabrication

technology. Context considerations concerning block alignment and routing, which

lead to a design frame, are discussed in the first section, An introduction to complementary p-well CMOS technology, the latch-up effect and measures to prevent

it are discussed in the following sections of this chapter. The remaining sections are

intended to provide a layout scheme for fully complementary dual row polycells based on a graph transformation method. The chapter concludes with a short section on cell simulation and logic verification.

107 108 V.2. Standard Cell Design Frame

Standard cells are modular building blocks with characteristic features that

provide compatibility among modules in the same family, like 74’ series building

blocks. For this reason, a standard cell design frame or template was developed from

the macrocell arrangements described in Ling's dissertation [17]. The main issues

involved in the design of a standard-cell template include: size, abutment, global and

local signal routing, power distribution, p-well placement, and latch-up prevention, as

well as time-delay reduction. From these considerations the need for a design frame

becomes obvious. All cells developed for this dissertation research were cast into the

frame discussed here.

The frame is tailored towards fully complementary logic and contains four strips of p-, and n-diffusion, as illustrated in Figure 36. In fact, most complementary logic

(gate equivalents) can be designed with a single row of n-transistors above a row of p-transistors [78]. When common gate connections of complementary transistor pairs are aligned, then simple cells can be constructed with a series of abutting source-drain connections of transistors. This “line of diffusion" scheme becomes even more attractive when two pn-stripes are run in parallel, so that common gate connections

(polysilicon) can be placed vertically and thereby intersect several diffusion areas at a time. One major advantage of this scheme is that all input teiminals of a cell become directly available on top (north) and bottom (south) for intercell connections. 109

p-trenalatara

n - u i o ll VtOIIJI

three metal-2 over-the-cell routing tracke.

H0UI2

n-tranalatora

HOIKS p - u i e l l

n - u i e l l

p-trenalatora BOOM W M W W M & m i '

Figure 36: Primary rows reserved for transistors.

The suggested frame has a vertical pitch (H) of 92“K and a variable width (WO.

Unlike several commercially available cell libraries [IS, 79], the width is not bound to a fixed unit raster that imposes an incremental growth in kA, steps (k>l). For a number of reasons, such as inclusion or adaption of other libraries, the cell concept presented here is flexible in horizontal extension; the width depends only on circuit complexity and implementation artwork. For several reasons, such as internal connection problems or cell row break points, a maximum width of 150X for a single cell should never be exceeded. n o

Cell designs are initially based on a grid size that can house a contact or a 4A,2 or transistor in each unit square. The grid forces contacts to occupy alternate squares and simultaneously prevents design mle violations, see PCj and PC2 in Figure 37. The method has proven itself useful since minimal transistor and contact sizes are responsible for the dimensions of a complete layout. A standard transistor of any polarity (2X by 4X) is composed of a unit square filled with diffusion, superimposed by a centered 2 \ polysilicon sheet. An example of a unit size transistor structure Ts is given in Figure 37. Transistor adaption to stage-ratio requirements is performed in unit steps. As an example, consider the transistor Ts in Figure 37. Active devices, such as the transistors of the inverter in the lower half of the cell, are primarily placed along

Vdd! and GND! power tracks and grow towards the opposite diffusion layer.

Arrangements of vertically aligned gates with short, polysilicon connections are desirable to avoid a large wire resistance. Ill

Figure 37: Design grid with invener structure

Track assignment and dimensions for a basic cell layout are given in Figure 38.

All connections within a cell are carried out in metal-one (Mj), metal-two (M2) or polysilicon (P) which is favorable for short cell internal connections. This does not apply for connections made outside of standard cells or blocks, and no sheet material other than metal must be used there. For this reason, primary data connection terminals (input/output) are made available on top and bottom of the cells in Mt.

Routing channels are placed between cell-rows with a preferred inter-cell data flow from left-to-right, unless a chain of cells is folded. 112

«------(II------►

r i ,5 w m m m m m m m m “ 35 i s ir n I ..,1 . to J M J 5 w m m A m m m m m m j *

a 12 92 1 1 MBM I 5 2 5 r

* 5 3 1 I 5 l_! L n r 12

A

Figure 38: Basic design frame

Within routing channels any aluminum layer may be used to establish cell-to-cell terminal connections. As mentioned earlier, polysilicon connections are limited to short cell jogs with a maximum of 2Q0X, because of the relatively high sheet resistance of 25C2/X2. Wherever polysilicon wires are used, the track should be adjusted to the underlying grid and widened to at least 3X to reduce the resistive component of the jog. Even with the high sheet resistance, poly is frequently preferred over metal connections. Polysilicon wires can be laid out "thinner" if space requirements are 113 stringent, and the wire is used to form gate regions along its path. In contrast to poly,

n+ and p+ diffusion should never be considered as interconnection material because

diffusion regions have a considerable voltage dependent capacitance to the substrate

and comparatively high sheet resistance. For these reasons, only diffusion sharing for

adjacently placed transistors is recommended to reduce the edge capacitances and

simplify latch-up prevention measures. Power distribution is conducted on metal-two

since it has the lowest sheet resistance of 0.027G/A,2 and the smallest parasitic

capacitance. Three power lines of 12X each, traverse the cells in west-east direction.

As shown in Figure 37, a single centered GND! track travels through the center and

two Vdd! tracks are located on top and bottom. Three horizontal M2 tracks of 3A. may

be placed in the 28X spacing between the power distribution rails to route logical signals horizontally across a cell. Over the cell routing tracks are distributed according to the complexity of the individual cell. More sophisticated circuit implementations may require internal M2 connections that may demand one such track. In such a situation, additional M2 tracks cannot be allotted and as a result fewer tracks across the cell are available. Metal-two wires are most appropriate for over-the-cell routing because M2 is allocated to the topmost conductive layer in the standard scalable

CMOS process. Generally M2 is mainly employed for longer interconnections since no vias other than M pM 2 are provided. For this reason, metal-one is the most precious and most demanded layer because it is the only one which can directly connect to any other sheet. The frame layout is symmetrical with respect to the central ground strip that supports a wide p-well along the center, see Figure 47. P-well contacts along the center support the tub along the GND! strip in order to prevent latch-up as discussed in a later section, The frame symmetry is disturbed only by two polysilicon connections located at the sides. Poly stubs and M2 tracks are deliberately not fully extended (-3X), to allow for adjoining cells to be connected through insertion of small conductive strips, called doglegs. A dogleg overlay (DOGLEG-P) of 6 X in width is

shown in Figure 39 (page 115). Overlays extend 3X into neighboring cells and

connect output (label -I = dogout) and primary input (label 0 = dogin) of

continuous cell arrangements. Doglegs and other connection blocks are treated as

cells with a width of zero lambda, so that wiring denominators can be generated and

independently included in the linearized placement. The scheme is quite simple and

involves an additive displacement value in east-west direction. When a row of

standard cells is assembled, successive cells are displaced by the width of the

previously placed cell. Thus doglegs become automatically superimposed because

their width contribution is zero. Proper horizontal connections (Pj ,Tj..T6) are ensured

without design rule violations. Vertical wires attach to interconnection terminals, which are located on top and bottom of active cells, as mentioned before. Terminal connection points cany all input and output signals of a standard cell to the routing channels or other designated wiring areas. Through-the-cell routing considerably reduces the number of contacts in the routing areas by using electrically equivalent terminals on both faces of the cells. Terminal connection points are specified by a linear label located at the cell edge and attached to Mt. Label names are character strings, composed of the function name, that is the same as the cell name, and an instantiation number (#) that is identical to the parse tree's nodeid and an integer wire denominator. The wire number corresponds to the parameter position in the implemented function's argument list. As an example, consider the D flip-flop structure illustrated in Figure 40. ■ I i

n il •rail

Orfgia left Joitlfled far logical cell*

c a n a C a l l a + 1 T\ Cell e • I i Cell iH Origin centered for doglegs. ■oglog o v e r l a y Figure 39: Polysilicon intercell connection stub 116

DFF_#/1 DFF_#/-1

1 - 1 E1 DFF(E0,E1) DFF_#

y D Q z t a OJ o o O) ■o o i f •o

numbering

DFF_#/0 Figure 40: Vertical cell connections of a DFF

The routing tool establishes wire nets based on a netlist specifying groups of terminals to be connected. Netlists are either extracted directly from the parse tree or created by hand. Actual wire connections are established with a routing tool. Since, routing terminals are not restricted to fixed locations on the cells, and cells vary in width, terminal alignment to a virtual routing grid is necessary. The routing grid spacing is based on the contact size, which is 4X. Figure 41 shows the extension and grid alignment for all terminal points to the routing frame that encloses standard cells.

Note, the design rule violation (M, -Af, £ 3A.) at the left-adjusted jog of terminal 1. 117 Design rule uloletlon Frame

Standard-Cell

Routing Grid 4

Figure 41: Stem extension and grid alignment

The design nile violation con be resolved when teiminals are spaced by a distance that is equal to the routing grid size. Figure 42 shows the same cell with a 4X spacing between adjacent terminals. 118

5982

Figure 42: Gridsnap spacing to terminal points

Note, that larger terminals give the router more freedom to choose appropriate

connection points. A terminal (linear label) with a size of at least 4X, is most

appropriate to avoid design rule violations and results in shorter wires. Generally, cells should be designed with a terminal spacing of at least one horizontal routing grid

(gs) and a linear label width not less than a unit contact size (cs). In addition, the distance between vertical cell edge and adjacent terminal must not fall short of the n+ to poly spacing (n+-p) so that pc-contacts (poly-Mj) do not cause parasitic structures at cell boundaries. Suggested dimensions are illustrated in Figure 43. Figure 43: Minimal dimensions for vertical connections

V.3. Complementary CMOS P-Well Technology

Complementary metal oxide semiconductor (CMOS) technology provides both p-, and n-typc field effect transistors on the same chip. CMOS circuits were first introduced in 1963 [80] and evolved from p-well CMOS with n-substrate. N-channel transistors (enhancement type) were placed in a strongly doped p-well to compensate for substrate deviations [81]. Such early CMOS processes were fairly complex but new advances have made CMOS the most used technology in VLSI. SCMOS

(Scalable CMOS), as used for the fabricated test structures for this dissertation allows for additional scaling, to adapt a mask level design to a desired feature size. Conflict free mask reductions require more conservative design rules so that scaled designs can be fabricated with a standard p-well process. 120

G ate-oH lde n- Substrate

Figure 44: Inverter in Si-Gate Technology

In the SCMOS process, p-transistors are created in the native n-substrate

constituted by a moderately n-diffused wafer. N-devices are created in strongly diffused wells that extend deep into the wafer. Figure 44 illustrates (a) circuit, (b) topology and (c) device cross section of a simple inverter in silicon gate technology.

Polysilicon gates for p- and n-transistors are connected and serve as input (A) to the inverter structure. Y is the common drain output, and its voltage depends on whether the p-, or n-transistor is conductive. For VA= 0 , the p-device is conductive and hence

VY= 1 . Since the current through a non-conducting transistor is negligible, it is clear 121 that nodes within a circuit are connected to either Vdd! or GND!, and respectively

driven high (1) or driven low (0).

The topology in Figure 44b, depicts the complementaiy (static) CMOS inverter

discussed above. It represents the photo-mask information as required by various

fabrication steps. Rectangles represent lithographic stencil information for certain

mask layers as imposed by design rules. The given topology reflects one possible

realization pattern for the physical placement of the pn-pair. A complete cross section,

through the center contacts, for the inverter in Si-gate technology is shown in Figure

Tiv*44c

V.4. Latch-up Prevention

Latch-up is a parasitic circuit effect that results in circuit failure and requires that

the entire chip be powered down to resolve the problem. Sometimes latch-up causes

chip self-destruction and must be prevented for this reasons well. Latch-up is defined

as the circuit condition where a high current is conducted between power and ground

through one or more parasitic thyristors in the conducting mode. Parasitic p-n-p-n devices are formed by a lateral p-n-p (Tt) and a vertical n-p-n (T2) bipolar transistor, as shown in Figure 45. Rs denotes the n-substrate, and Rw the p-well resistance.

When enough electrons are injected into the n-substrate, (by a spurious noise spike from electrostatic discharge, for example) a voltage drop URS results from p-well electron collection. As soon as URS drops by 0.7 volts, T ( becomes conductive and the voltage across Rw increases. T2 is switched into low-resistance mode when URvv exceeds 0.7 volts. 122

6 a t e

n- Substrate A GND I Parasitic Thyristor. Equivalent Circuit. Figure 45: Parasitic pnpn-devicc

Since both transistors are regeneratively coupled each transistor drives its

counterpart into saturation. The undesired full conducting mode persists until the

circuit is powered down. Irreversible damage to the chip results when the maximum junction temperature of the parasitic device is exceeded. The possibility of latch-up

may be reduced by layout or guardring techniques and process design or technology

measures. Process design and technology related procedures, like beta reduction.

trench insulation, etc. are not considered in the scope of this dissertation research.

Despite this fact, parasitic resistor elimination, also known as resistive shunts, must be

mentioned since it is the simplest and hence most often applied technique. The method

suggests making parasitic resistors Rs and Rw small enough to yield a large bypass base current. This ensures that the base voltage on at least one of the two transistors

never exceeds 0.7 volts. Resistive values of Rs and are effectively reduced by substrate contacts that connect the well directly to the appropriate power source, see

Figure 46. Additional precautions for latch-up suppression can be taken at the layout or circuit design level. Most critical is the lateral spacing between p+ and n+ regions.

A very small spacing is desirable but may not be achievable since "punch through" to the well or oxide cracking may be the result. 123

n*-substrate contact p+-subst

^p-Ulell n- Substrate

Figure 46: Inverter with additional substrate contacts

Several successful fabrications have shown that a packing of n-devices towards

GND! and p-devices towards Vdd! with one substrate contact for five to ten transistors effectively prevents latch-up within cells. Even though smaller cells, like the inverter

(NOT(expression);), are too narrow to allow for substrate contacts, no latch-up or related problems have surfaced up to this point. This is partially due to the three rail power distribution scheme that ties both rows of n-transistors to the centered p-well ground strip. The well is properly grounded by wide p+-islands and at least one centered large p+ power contact. Latch-up propagation over large portions of a chip are avoided through vertical p+ strips at p-well abutment faces and n+ bands that partially overlay both power rails. Vertical p+ strips, as shown in Figure 47, ensure that a single latched pnpn-path disturbance does not cause a failure of other nearby pnpn-paths which shore the same well. Minimal (Nmin) or exact dimensions (N) are specified in the drawing to ensure cell abutment without design rule violations.

Dimensions stated in the illustration are concerned with p-well and latch-up prevention only. 124

**************** AJ* T

&GNDI A m * * * *

Uddl

Figure 47: Substrate contact and well protection.

Experiments have shown that a single unit n+ substrate contact (4X2) is sufficient for a set of four transistors with an equivalent gate area of 24X2 in a proximity of 5OX.

Bigger contacts and large n+ islands or guard structures are recommended. Similarly, unit size p+ well plugs are required for five transistors each, which corresponds to about 32 squares of active area. 125 V.5. Physical Layout of Standard Cell Logic

This section introduces a transistor placement scheme for standard cells that

implement an arbitrary function. The scheme provides an efficient initial placement

for all transistors of a fully complementary circuit. The initial assignment of transistor sites prepares a standard cell layout and minimizes the number of long metal one connections.

To illustrate the design strategy, consider the exclusive-or gate level equivalent illustrated in Figure 48 a. A non-trivial mapping of transistors onto parallel p-, and n- diffusion strips in the design frame must be employed. After a complete mapping, all transistors are assigned to frame locations on diffusion stripes for subsequent wiring.

Common gate connections (pn-pairs), shown in Figure 48d, e.g., signal A: pl,p5,nl'^i5’; result in vertically aligned transistors and a corresponding polysilicon wire, Figure 48c. The horizontal ordering results from common source-drain nets, i.e. netssfd 2 >* Source-drain connection points are implemented as continuous diffused regions. 126

B A0B

A®B

Lp i h ^ j + b; • r « * M j r r - U J # [V

A® B

Gate nets p-gatet n-gatet n 1 s 1* S' • i 4 ~ 2‘ 4~ a s s , - a t »' Figure 48: 12 transistor EXOR implementation.

V.5.1. Placement Algorithm for Fully Complementary Circuits

Fully complementary logic is composed of p-channel and n-channel transistors whose gates are members of the same net (pn-pair). According to the frame layout, two pn-pairs can be vertically aligned and simultaneously driven by a vertical polysilicon track. The basic transistor placement algorithm assigns a maximum of four transistors (two pn-pairs) to a vertical polysilicon track. Larger gate nets are formed by physically adjacent pn-pairs which are also driven from vertical, tracks. Additional polysilicon tracks are electrically connected (shunted) to the first track so that undesirable effects caused by high sheet resistance are eliminated.

The algorithm placetransistors given in this subsection, traverses an unlabeled circuit and selects pn-pairs in the first step. Figure 48b, depicts the transistor diagram with pn-pairs labeled in ascending order. A graph G is generated in the second step of the algorithm. Statement 2.1 generates a node for each pn-pair, and statement 2.2 creates undirected edges in G. Undirected edges correspond to horizontally shared diffusion areas identified by circles in Figure 49a.

[pinil Iplnl J petition I poalUsn 4

petition •

petition 3

pooltlon 1 pooltlon I potlUon I Nod* colors: ea A o B rrm u 0»-0— Cp,B0 ESS V pooltlon t pooltlon 4 pooltlon •

Figure 49: Equipotential points and colored graph G

Edges between pn-pairs represent an equipotential source or drain terminal point.

Graph coloring is conducted in the third step with complete coloring with statement

3.1, and color marking with statement 3.2. Completely colored nodes are driven by the same gate net and should be vertically aligned or closely grouped if more than two pn-nodes are driven. The fourth step of the algorithm constructs a bilinear graph from colored nodes in G. Node positions correspond to pn-pair locations in the design frame, with position 1 in the upper left comer location. Since the layout is constructed 128 in top-to-bottom, left-to-right fashion, two initial nodes are selected in statement 4.1.

The first node is a primary input, as denoted by the first argument position in the

parameter list.

A primary input must occupy the first position because it may receive its input

from a directly adjacent cell through the dogleg stub. As in the previous selection step,

an unplaced node with the largest number of connections to other pn-nodes is chosen

so that a maximum number of gate terminals of both selected pn-nodes can be driven

simultaneously. This principle is maintained throughout the entire placement process

to minimize the number of vias and to allow for input terminals on the top and the

bottom of the cells. Tie-breaking ensures that a maximally connected node which is

most distant to the output is chosen. This is important because a long wire that

eventually blocks necessary vertical connections of the same layer is effectively

avoided. After two leftmost pn-nodes are selected all remaining nodes are placed in

the fifth step. Thewhile-loop contains four groups of case selections to chose a pn-

pair among all unplaced pn-nodes. Step 5.1 seeks for a pair N that is driven by the

some (gate)signol as its immediate already placed left neighbor, or its vertical partner.

As mentioned before, gate nets are most critical and require short connections of low

resistance. Since polysilicon gate wires should not be interrupted, completely colored

nodes are preferred over edge connected nodes in Gf. An interruption of a gate wire is

undesirable because at least two vias are necessary. For this reason, edge connected pn-nodes are considered as the second choice, in step 5.2. In practice, horizontal

connections of pn-pairs driven by the same signal seldom occur because such pairs are either redundant or have a common point tied to either rail of the power source. In the

later case, the horizontal connection can be eliminated and replaced by contacts to

Vdd! or GND!. The selection in statement 5.3 establishes connections between signal producing and passive driven nodes. Color marks, as attached to nodes in step 3.2, 129 correspond to source terminals in electrical nets and fully colored sink nodes should be assigned to locations closeby. Step 5.4 resumes the case selection and attempts to complete horizontal source-sink nets in statement 5.4.1 and upper and lower row distribution in statement 5.4.2. In case none of the pn-pairs was selected for placement, the conditional 5.4.3 evaluates to true and, as in step 4.1, an unplaced node is chosen. Step 6 establishes the actual placement of transistor pairs within the design frame starting in the left-upper comer in top-down order from left-to-right. P-, or n- transistors of each pair are assigned to the appropriate diffusion strip. Basic Transistor Placement Algorithm

place transistois(drcuit): begin 1. /* PN-Pair selection */ k « 0 ; for each unlabeled p-transistor do ( kak+1; find its complementary n-transistor; label p-transistor = Jq label n-transistor a k I ]

2. /* Construction of graph Q */

2.1 /* Node generation in O */ for each pn-pair l.ic do { generate a node K in graph G; )

2.2 ' /* Edge generation in G */ for each node L.kdo ( for jet 1 tokdo { if (pn-pair k has a common equipotential point in pn-pair j) then create an undirected edge kj in G; } )

3. /*Grapb coloring •/

3.1 for each input signal S do { select an unique color Sc; forla itokdo ( if (node 1 is driven by S) then completely color node 1 with Sc;}

3.2.1 for each pn-pair produced driver signal T do( select an unique color Tc; fori a 1 tokdo ( if (node 1 is driven by T) then mark node 1 with Tc;)

4. /* Construct bilinear graph */ posa l;

4.1 /* Select initial nodes */ { select primary input node N that has the largest number of edges attached to it;

4.1.1 if (tie between nodes) then { select the unplaced node N with a. the largest number of edges to secondary input nodes, b. has the highest number of levels to to output terminal; if (tie again) then select either node N involved in the tic; }

N.posidon = pos; pos b pos + t ; );

if (there exists an unplaced node W which is completely colored like N) then { select node W ]; else ( select an unplaced input node W that has the largest number of edges attached to it;

if (tic between nodes) then ( select the unplaced node W with a. the largest number of edges to secondary input nodes, b. has the highest number of levels to the output terminal; if (tie again) then select either node W involved in the tie; | ) )

W.position b pos; pos b pos +1; |

/* Place remaining nodes *1 while (unplaced nodes exist) do [ selected = false; switch unplaced node N (

case complete_color(node[pos<2]) = complete_color(N) ; select N; break;

case complete_color(node[pos-l]) B3complete_coIor(N): select N; break;

case N is connected to node[pos-2]; select N; break;

case N is completely colored with maik_coloi(node[pos-2]): select N; break;

case N is completely colored with mark_cotor(node[pos-l]) select N; break;

otherwise: {if(N exists that is completely colored with a mark color Me of a placed node) tben if (odd(pos» 5.4.1 then { select an N that is completely colored with Me of an odd node position; selected = true; } 5.4.2 else { select an N that is completely colored with Me of an even placed node position; selected = true;}

5.4.3 if (not selected) then ( select an unplaced input node N that has the largest number of edges attached to it;

if (tie) then ( select the unplaced node N with a. the largest number of edges to secondary input nodes, b. has the highest number of levels to the output terminal; if (tic again) then select either node N involved in the tie; | } J ) /• otherwise */ 1 /• switch */

N.position « pos; pos ** pos + 1;) /*wend *1

6. /♦ Establish Placement */

6.1 /* PN-Paits */

>= 1; while ((i<»lc) && (node(k) is pn-pair)) do ( if (odd(i)) then (place pn-pairti] above GND! rail;) else (place pn-pair{i] below GND! rail;) move one spacing to the right; i = i+1;};

end;/*placetransistors*/ 133 V.5.2. Placement Considerations for C-Switches and Single Pass Transistors

Passive multiplexor and selection structures are usually implemented by

complementary C-switches or single pass transistors. Since passive transmission

components are merely inserted in signal paths with the effect of controlled signal

attenuation, entire circuits may be considered for initial placement without

complementary transistors. One strategy is to remove single pass transistors and to

perform placement with the algorithm given in the last section. In the wiring phase,

single pass transistors are aligned with vertical polysilicon wires and inserted in free

horizontal tracks.

Figure 50 (page 134) presents a simple conditional function AWXc.fktn^fktnf)

that takes three arguments: condition (c), a true part (Jktnt), and a false part (Jktnj).

According to condition c, either fktn, or fktnj is selected and MFXc.fktn^fkuif)

evaluates to the negation of the selected part Fully complementary circuits composed

from C-switches and pn-pairs can be casted into a design frame with the algorithm

presented. It is advantageous however, to extend step 3.2 because transmission gates

cannot produce driver signals. For this reason, one treats outputs of C-switches like

signal sources and this naturally leads to a modified version of the coloring scheme.

The modification allows several C-switch output nodes to form a combined source,

see Figure 50 net v. In order to physically tie pass structures to sink nodes,

corresponding nodes are marked or completely colored, and the extension step 3.2.2 is

as given below.

Extension for C-Switches:

3.2.2 for each C-switch driven net V do { select an unique color Tv; completely color all driven pn-nodes in V with Tv; mark all C-switch transistors connected to V with Tv; 1 /• End-extension */ 134

fktir 0 — NIF(C,fktn t ,fktnf ): fit til;

NIF

position 2 p osition 3 Node colors: EZ3 C mm u V position 1 position 4

Figure 50: NIF conditional implementation

Transmission structures which are not connected to Vdd! or GND! can be compacted by a simple and intuitive method that suggests the relocation of transmission, or pass-transistors to free tracks. Even though unpaired transistors should be avoided within standard cells, occasionally single precharge-, evaluate-, and loop-transistors are employed in pseudo NMOS and dynamic CMOS circuits. Several small extensions to the placement algorithm are necessary to capture pass transistors 135 in the graph. For each pass device, a node s must be added to G, as in step 2.1.2. At the same time, a list is attached to the newly generated node s, which contains all equipotential points electrically connected to k’s gate. This list stipulates vertical gate alignment similar to pn-pair lineup based on node coloring in step 3.3. Vicinity of channel terminals of the pass device to source and sink within the circuit is assured by edges in G. Since pass transistors are assigned to locations between pn-pair tracks, and conducting pass switches are equivalent to wires, single switching devices ire replaced by wires in the circuit after corresponding nodes are created in G. Thus graph edges can be generated in step 2.2 that immediately follow the switch replacement, and the desired result of short horizontal connections.

Extension for Unpaired Pass Transistors:

2.1.2 /* Node generation for unpaired nodes */

for each unpaired transistor do { k = k+l; if (p-type transistor) then { label p-transistor» k ;} else { label n-transistor = k I } generate a node k in C; attach to node k a list KN that corresponds to k’s gate net; replace the source-drain channel in the circuit with a wire connection;}

3.3 /* Complete coloring of unpaired nodes */

while (uncolored nodes exist) do { consider node NA; End pn-sourcc of NA; select color of source node for TS; for (all nodes NN in the list KN of NA) do { completely color NN with TS; } 1 /* End-Extension •/

Pass transistor allocation is a two-stage process conducted in the statement sequence 6.2. The first part 6.2.1, assigns hitherto unconsidered nodes i to vertical 136 polysilicon tracks, and the second part 6.2.2, performs a permutation within each

track. Index i in step 6.2.1 denotes the first pass transistor to be considered for placement if unplaced p-, or n-nodes exist. Remaining nodes i to k are placed for gate- net sharing according to their individual color. After all remaining nodes are allocated to vertical polysilicon tracks, switches are permuted causing horizontal dependencies to result in short connections. Pass devices are arranged in each well area, in such a way that transistors which connect to pn-pairs in close proximity are placed closest to

GND! or Vdd!. Thus longer jogs are propagated to the middle, away from the power distribution rails. This relinquishes space for cell internal routing, vias, and n*, p+ areas. Pass Transistor Allocation:

6.2 I* Single pass-tnnsistor switches */

6.2.1 while (i

6.2.2 for (all vertical poly tracks m) do ( permute pass transistors in m according to routing cost; )

V.6. Cell Simulation and Logical Verification

Previous sections of this chapter stressed physical design considerations at the transistor level. Important functional design and timing issues are considered now.

Speed optimization is of major importance because the overall system performance is based on individual cell parameters. Among several speed enhancement techniques, stage ratio adaption is most favorable because pn-pairs on equivalent stages in the logic are driving a similar load, denoted with CL. This load corresponds either to a number of cell inputs, a long aluminum wire with contacts, a pad driver, or a combination thereof. Hence, it is necessary to buffer in- and output- 137 terminals of cells so that signals are not attenuated to a level close to or below

threshold voltage. It is also important not to connect storage nodes to terminals, because detrimental effects on the state can be expected. A minimum delay can be achieved when transistor sizes (L/W) in successive stages are adjusted in width. Mead and Conway [S] suggest an optimal stage ratio factor r of 2.7, but experiments have shown an r=2.0 is sufficient for good results in combination with MOSIS’s SCMOS technology. It is obvious that transistors grow very fast even with a small stage ratio factor of 2.0 and hence, functions must be implemented with a minimum transistor stage logic equivalent. Figure 51 shows two possible implementations of an EXOR function with either one (51a) or two (51b) stages. The two-stage design is preferable because a ratio of r in the second stage requires only one grid space in the design frame. Circuits with more than three stages (n>3) must be either redesigned or carefully SPICEd in order to avoid extremely large gate capacitances (£ratio—>r").

Fully complementary logic with few stages and a minimal number of embedded pass devices or dynamic storage nodes yields best results. In fact, transfer-gate-driven storage nodes often lead to non-functional silicon, because over-the-cell. routing causes "noise" pulses that may well exceed the threshold voltage. 138

A B AOB

A

B

AOB

Figure 5 1: Stage ratios for two EXOR implementations

*'‘Trickle” or weak inverters should not be considered. Storage nodes implemented as controlled feedback inverter pairs should be utilized as an alternative.

It has been proven useful to verify circuits on the switch level with a simulator, such as ESIM [6] p.268, [82], ESIM is an event driven simulation tool that provides quick logic verification by employing a simplified transistor switch model. In the simulation process a user prepares test vectors and communicates interactively with the tool to detect erroneous connection points or, incorrect logic constructs. Input files to the simulation tool contain a textual circuit description either entirely user specified or obtained through an automatic layout extraction process. The simulator flags floating nodes connected to pass devices that may be potential sources of problems in an actual implementation. From this information, uninitializable nodes can be inferred and eventually removed. ESIM is an essential tool that detects almost all errors in the logic (except those caused by timing and delays) and drastically reduces cell design time. 139 Critical timing path, worst-case delay analysis and optimization for the standard

cell building block, follows using the tools CRYSTAL [83] and SPICE [84].

CRYSTAL’S timing analyzer estimates all possible paths from inputs to output and

indicates the critical path which must be tuned. By employing an RC-model,

CRYSTAL provides quick and reasonably accurate feedback about the approximate timing performance. The data obtained are used merely as guidance to improve the cell by resizing transistors along the critical path. Worst case delay and rise-time analysis is performed with SPICE after CRYSTAL'S results satisfy the self-imposed performance goal. SPICE is a network analysis program for nonlinear circuits. It allows for DC, AC and transient analysis of transient variables within a cell as a function over time according to some user specified input stimuli. Realistic simulations for standard cells are obtained when input signals are defined with rise times of V/t - 4V/4nsec and an output load CL - 300 fF. This value corresponds to the sum of a standard input load (2*120 fF inverter input) plus a 4A. aluminum wire of

600pm in length.

The switching speed or dynamic response of CMOS circuits depends mainly on parasitic capacitances associated with switching devices. Interconnection capacitances with the exception of diffusion sheets are negligible and only the sum of gate capacitance Cg and diffusion capacitances Cd are considered. This is sufficient as first-order approximation, because exact parameters are known only after the fact - when the silicon is returned from fabrication. Adequate estimations of the capacitances and C2 for an inverter pn-pair, as in Figure 52, are obtained by specifying geometry parameters (W,LAS,AD,PS,PD) for each transistor, and model parameters (TOX,CJ,CJSW) for the desired run. 140

7 7 7 7 7 7 7 7 7 7

Ml and attached sheets C

-HI "__ In out polyslllcon and C, attached sheets C

Figure 52: Pn-inveiter pair estimation of Cj,C2

The input capacitance Ct corresponds the sum of poly and all attached sheet capacitances (Ct), and both gate capacitances Cgp, CgQ. Gate capacitances are computed from the thin-oxide thickness (TOX) specification, and the extracted values for channel width (W) and length (L). C2 is calculated likewise and requires that CJ

(zero-bias bulk junction bottom capacitance per square meter of junction area) and

CJSW (zero-bias bulk junction sidewall capacitance per meter of junction perimeter) is specified as a model frame parameter. Area and perimeter values must be hand extracted from the layout, and a simple estimation for a single transistor device is given in Figure 53. 141

8 U r °

G

U r e a : Perimeter: RS - ui*l«; PS - (uMs)*2; RD - iw*ld; PD - (u>+ld)*2; Figure 53: Device parameter estimation

In case two transistors, Tj and T2, share a diffusion area, only the device included in, or contributing to the critical path, is examined. Consider Figure 54 in which T2 is assumed to contribute to the critical path.

G i G 2

i M 1 ifjfliljll? IM'If;

Figure 54: Parameter estimation for T2

For proper dynamic analysis, polysilicon wires with a length greater than SOX are to be considered and the corresponding resistance must be included in a transient analysis. For this reason, it is necessary to extract such wires by hand and to include the resistive value in the SPICE deck. Cell-internal aluminum connections in Mj, M2 or vias have negligible capacitive or resistive values (unless vety large sheets are considered), and are not taken into account for the calculations. All simulations were based on the Shichman-Hodges transistor model (NMOS1.PMOS1), which leads to simulation results close to tests performed on first silicon. 142 V.7. Summary

The standard-cell design method presented in this chapter enables the circuit

designer to devise standard-cells for frequently recurring subfunctions or special

functions. With the extension of the design approach down to the transistor level, a

circuit designer is now capable of synthesizing standard-cells from transistor

diagrams. Functions that are normally composed of many area consuming standard-

cells can now be implemented as a single cell. As the chapter has shown, specialized

standard-cells have several advantages of which critical path shortening, minimum

area consumption and wiring reduction are most prevailing.

The chapter began with the introduction of a standard-cell design frame, to

ensure the physical compatibility among cells within the standard building block

family and the corresponding routing environment. The core of the chapter was

concerned with the construction of the entrails of a cell using an algorithmic approach

for transistor placement. Transistors were treated as pn-pairs and assigned as such to

locations within the frame. The proposed dual-row placement method, that is the

major contribution in this chapter, has been proven effective during the development

of several basic cells. Together with the functional placement scheme it forms a powerful design environment that allows the development of custom standard-cells,

and subsequent automatic placement. Many test structures have been successfully

designed with the combination of these tools and an evaluation of the approach is presented in the following chapter. The evaluation compares functional placement for different routing strategies with simulated-annealing that is acknowledged as the most

area efficient polycell placement technique. Chapter VI

EVALUATION OF THE APPROACH

VI. 1. Introduction

Previous chapters have shown that a functional approach for standard cell placement is both advantageous and practical. These benefits were demonstrated with numerous examples of medium complexity. Several small test structures and two complete chip designs have been fabricated and successfully tested. Only a few design flaws were uncovered in the testing phase, and thus the functional approach has proven to be useful and effective. The silicon has provided a source for numerous measurements about effectiveness and performance of the approach. During the implementation and testing phase of the functional placement tool, several evaluations were conducted of physical chip characteristics such as delay, rise and fall time, and latchup sensitivity. Functionally placed standard cells were found to be much faster than equivalent designs using PLA's. A four-bit counter implemented as a finite state machine with a PLA core had a maximum operable frequency of about 4 MHz, and the functionally equivalent design based on standard cells was operable at over 10

MHz.

This chapter presents individual test results and evaluations of the functional placement scheme versus high-temperature simulated annealing. The annealing method was chosen as a comparative method because it is a widely acknowledged

143 144 approach for standard-cell placement in industrial and academic environments.

Numerous designs of small- and medium-size circuits with characteristics different

enough to generalize results served as testbeds. Since simulated annealing is

computationally very expensive, a comparative evaluation of large designs was not

done. For this reason, the problem size was limited to 500 standard cells for all

examples. The choice of this maximum size was influenced by measurements taken

from commercial photomicrographs. It was found that most designs contained

standard cell blocks of similar or smaller complexity. Since different manufacturers

use different feature sizes, the complexity was determined from the area of a standard cell block, the number of cells contained in this block, and the total number of pins

within all wire nets.

Comparisons between the functional placement approach and simulated annealing is based on cost parameters obtained with TimberWolf3.2’s [4] cost procedure. The basis for each evaluation is a placement and netlist description from which wire length and estimated wiring area are determined. The wire length for an

individual net is computed using the half-perimeter bounding box calculation. A bounding box is defined as the smallest rectangle that tightly encloses all pins of the net. If a wire traverses a cell in either direction, the area estimate contains a constant value for the area consumed by the cells. Thus, an adjustment of the wire length and wiring area is required for each cell. Total area estimates are approximated by the sum over all channel contributions after cell placement. All studies assume two-layer

Manhattan style routing with equivalent horizontal and vertical cost factors. A special penalty for vias is not considered in any analysis, even though a minimum number of contacts is desirable. Since all calculations are based on exact pin locations, it is necessary to conduct two independent evaluation steps. In the first evaluation intercell connections between adjacent cells (doglegs) are maximized. In the second evaluation 145 internal connections are replaced by conventional wire jogs. Two functionally equivalent arrangements, one with doglegs and one without, were created for each example and then individually judged.

Despite all efforts, an actual routed wire length analysis was not possible, because Magic's obstacle avoiding router [1] cannot fully utilize existing wires inside cells for vertical routing. For this reason, the router was not considered for practical evaluations, even though it was employed to prepare several cell blocks for fabrication.

The following sections present a number of measurements performed on functional placement and a comparison with simulated annealing. In the next section, the standard normal random variable z, is introduced for the continuous probability distribution of the simulated annealing process. Values in the range from and -kz represent 'good' results which can be achieved in n runs. The number n is important, because simulated annealing is computationally very expensive and results of the process are spread over a wide range. The third section presents measures concerning run-time and routing-cost for a through-the-cell wiring strategy. In addition, some data collected with a force directed interchange technique are illustrated as comparative margin in the line graphs. The fourth section suggests a time-efficient low-temperature annealing process that reduces the cost of some functionally placed arrangements even further. The last section summarizes the measurements and results discussed in this chapter. 146 VL2. Basic Measurements of Simulated Annealing

Simulated annealing is recognized as a powerful optimization technique for

standard-cell placements. It is based on an analog to the annealing phenomenon found

in the crystallization processes. During crystallization, elementary building blocks

move freely at a high temperature and become more and more bound to a fixed

location as the temperature T approaches the freezing point. T determines the relative

degree of freedom for block movement. The system configuration is measured by cost

parameters that evaluate block locations (placement) and block relations

(interconnections, or wire length).

During an annealing process, a random initial configuration is started at a high

temperature so that no restrictions are imposed on the movement of building blocks.

Each iteration in the annealing process considers a set of random moves. The size of

this set, i.e., the number of new states attempted per building block per iteration, is

called attention per cell. New cell locations are incrementally generated from the previous placement by weighted random selection. Two selected cells are either exchanged, or a single cell is relocated, depending on acceptability of a newly generated state. Moves that improve the configuration are called downhill moves. A minimum cost configuration is aspired to by the unconditional acceptance of all At * downhill moves and an acceptance of uphill moves with a probability of e~T. Ac is the estimated increase in the cost for a particular cell swap. Since the annealing method is a greedy search involving (pseudo-) random decisions, it is natural that

"wrong” selections are accepted at times. "Wrong" selections are necessary to avoid getting stuck at local optimum and to find more global minimas.

At high temperatures, as in the early stages of the algorithm, virtually all moves are accepted. This results in basically a total randomization of an initial configuration. 147 In addition, the final placement depends substantially on the temperature scheduling during the entire process. The TimberWolf3.2 implementation lowers the temperature

T exponentially at the end of each iteration ( Tnrw=aiTolJ)Told\ with 0 < a £ 7 '); thus, many global exchanges are performed in the initial stages of the algorithm. The annealing process terminates when several (here three) consecutive iterations yield the same cost value.

Since placement problems are NP-hard problems [2,29], it is clear that any non- exhaustive method attempts to find a local optimum. Simulated annealing searches for a feasible solution through Monte Carlo optimization. Figures 55 and 56 visualize the frequency distributions for one hundred optimization attempts for a fixed initial configuration. The clock circuit examined here consists of 213 standard cells and contains 155 wire nets with a total of 842 net pins.

Final Wiring Cost

Figure 55: Cost distribution of the annealing tool. MBtN Wlf» Length flOO] Figu re 56: Wirc-Icngth distribution of the annealing tool.

Both frequency charts resemble normal distributions Narca(41586,16852)*

N|CDgth(42949,17412), for a sample size of 100 annealing runs. Even though the envelope is not perfectly bell shaped, the chosen sample size compares reasonably well to the population of the results attained by TimberWolf3.214. The data in both charts suggest that a large number of tests is needed to obtain a “ good" result, but this is generally impractical because simulated annealing is very computationally intensive. This is especially pronounced when a very slow cooling process, combined with a high attention per cell, is applied to a problem. One test of the clock circuit, with parameters specifying slow cooling, required 258 seconds of cpu-time on a

Pyramid 98X dual-processor machine, running Berkeley Unix 4.3, with algorithms coded in standard 'C* [86]. For this reason, optimal results are estimated with a statistical inference method from mean and standard deviation s of small samples, refer to Figure 57.

l4Tbe X3 analysts (goodness-of-fit) for tbe continuous distributions according to [85] p.414, comes out satisfactorily. Knowing the mean and s, one can determine the standard normal variable z , which corresponds to the probability of one probing (test) being found at or below a certain cost value. From graphs 55 and 56, a z is determined so that one result comes out below the mark for each graph.

Figure 57: Estimation of the lowest cost.

The arrows in each chart give the exact z-values for the smallest cost that was achieved within 100 runs. A probing below a specific z-value can be given for a probability Pc, probability correct, within n-runs by /*C=1-(1-J* N(0,l)d:)n. As an example for four probings below a z--k*s is shown in Figure 57. The probability of a result being found below z is represented by the integral over the normal distribution

N(0,1). Solutions of the integral, that is, the area of the section under the bell curve, are determined with tables found in [85] p.627 or [87] p.464. The term, one minus the integral, represents the probability P for a single test yielding a result above the z border. P,,= (l-fI N(0,l)dz)n represents the probability that guarantees tests being found only above the z in n-optimization attempts, assuming value replacement and 150

Pc = 1-Pn; makes sure that at least one probing falls within the interesting range for a

probability Pc.

As an example, consider the final wire length distribution in Figure 56. One

result can be expected with a probability of 95% below the one percent marker when

P(z < 2.326) = 0.1587, and Pc = 0.95. For this reason, it is necessary to perform eighteen15 runs of high-temperature simulated annealing to get a result below the one percent mark. Such a large number of tests is usually not practical, especially when the number of cells is high.

VI.3. Comparison of Functional Placement and Simulated Annealing

Standard cell placement and other NP-complete problems are generally approached with time consuming approximation or heuristically-based methods.

Among several techniques, simulated annealing is preferred, because a placement of high quality is achieved within pseudo-exponential time. A pseudo-exponential time behavior describes a computing time that depends on the input and the seed of the pseudo-random generator. The worst time bound is an exponential function, and the actual time depends on the selections made during each iteration and the acceptance criteria. In fact, small examples may be computed quickly even with an exponentially- bounded algorithm, because the time may initially grow more slowly than the time required by a comparable polynomial bound algorithm, that is, n5 » 2n for

1 < n < 10. Thus, for a small problem size, iterative simulated annealing [88] is preferable over Min-Cut [32,53] and functional placement. However Min-Cut and

lsThe minimum number of required tests calculates as:

" * W l? )0fa(l-fl.lS87)al734 ** W_18; 151 functional placement methods finish large problems (over 1000 cells) up to 100 times faster than high-temperature simulated annealing. In addition, Hartoog [89] has shown that Min-Cut with Steiner tree improvement can at least match the results produced by simulated annealing. The Min-Cut placement technique is based on graph paititioning.

Since the graph paititioning steps are based on a heuristic, it is clear that different starting configurations give rise to different results corresponding to local minima.

An applicative expression can be parsed quickly into a tree, because LL(k)- parsing with 0(n2) complexity [90] is the basis for this method. The transformation from the tree into a layout is equally simple because a functional expression contains the circuit designers’ intuitive solution to the NP-complete placement problem cast into an LL-language. Restrictions imposed by the grammar guarantee modularity and avoid duplicate subexpressions. Thus, reasonably good solutions ate obtained with the functional method. Furthermore, functional placement is attractive for large projects since shared subcircuits naturally lead to hierarchical expressions.

For the evaluation of both methods, several sequential circuits (a digital clock, a frequency divider/counter and a pulse width modulator controller) and one purely combinational circuit (a four bit ALU - SN74181 clone) were investigated. The measurements on the clock circuit are presented here and are representative of the results obtained with the other circuits. The main results ore displayed in Figures 58,

59, and tabulated in the spread sheets 4 and 5.

The first pair of graphs (Figure 58) illustrates the results for a strict around-the- cell wiring pattern. In both charts, the vertical cost axis is scaled to fit, and the horizontal axis shows the number of rows obtained by snake folding. For each row configuration, each diagram contains five data points, two of which correspond to functional placement. The other points give the results for simulated annealing based on production-run parameters. The settings specify an annealing process with a very 152 high starting temperature. This initial temperature decreases rapidly until the cost evaluation shows a significant change. Smaller temperature reductions are applied thereafter. Throughout the cooling process, T is lowered "exponentially" with a parameter a. One section of TimberW ol0.2’s user manual [56] suggests an a as well as a set of production run parameters which yield most compact layouts. Whenever results of simulated annealing are given here, these production mn parameters are employed. The initial cell configuration for an annealing process is based on a snake folding for a certain number of rows and the cost for either a layout with all wires routed in the channels or with result propagation doglegs. Since intercell connections effectively remove one or two tracks in each wiring channel, all charts indicate the expected improvement. The estimated minimum wire-length and total-cost are determined from data collected from several optimizations. For this purpose, ten independent annealing processes were initiated for each snake-folded layout.

Estimates are calculated with the standard deviation of each data set and the standard normal variable z, as described in the preceding section. 153

Table 4: Data for around-the*cell routing of a clock chip.

A B CD EF Q H 1 Ckx* Chto: 14* Celt: 155 Nan: 042 Pmi 2 Numtarof Row* 2-now* 3>Rowt 4-Rowi S-Rowt e*Rowi 7-Rowa B-Rowi 3 Matlmal WWh o( a Row IUmM»l 2000 2000 1500 1200 1000 65C 750 4 5 Functional (DOflUWrtMJGMl 81641 47207 30516 37080 35004 3475! 35638 • Functional (DOOFWira Unaih n a M .lo coin _51641 47207 30516 37980 35084 3475! 7 FuneMNOOOQMnWaJ WWno Co»l 54210 48026 42661 41057 41202 40073 39136 a FuncMNOOOOHnWal Wirt Lana* (watodl 54210 48028 42661 41057 41202 40073 30136 • Channal Conlribuliota rteaiadl •465 •402 •423 •433 •442 •434 •467 i a 1 1 Skn. Annaalno on FuncUHOOOOt WM(u Cott ProMno 1 4102B 40205 45616 47838 80203 46831 50018 i a ProMno 2 43000 44886 46530 48643 40450 48729 48807 "-»■-- • i a rrvNV 3 47701 47423 45631 47423 43173 47159 46433 1 » ProMna 4 1 * PreWno S 17 ProMno a ------1 • 1 a ProMno a Bawhkbwi A ProMno 10 Minimum oroMnolHO Avaraoa aroblnolrlO 481601 45486 45057 46431I 48616 46012 Standard davMon oroMnalrlO 2181.3 1577.1 1746,4 1006.3 1042.! i a i a .2 Eadmmd hwratl - AVER-2.624‘STDEV 30784 42666 41374 41103 41516 42141

Obn. AnnaaHna on FunctlNOOOO) Wkt Lanoth ProMno 1_ . _ 48887 47282 40782 . 52152 40107 52506 ProMno 2 46184 47646 40413 40580 47906 49107 46773 40124 47276 46076 44666 40343 47700 [ a i l ProMno 0 46878 46370 47661 47121 46404 46526 40400 ProMno 0 5170S 40130 5151^ 48357 46604 51206 47923 ProMno 7 46221 43701 40001 48896 40160 4569! 50551 ProMno a 47061 50317 80317 44751 40066 44439 50312 ProMno 9 50336 48039 48036 50256 46685 40106 46235 [ a r l ProMno 10 48516 46554 46554 46421 46800 40570 49800 MWmum ornblnolTlO 44216 43701 46554 44761 44666 44436 46235 Avaraoa oroMnolrlO 48033 47325 48634 47709 47706 47034 48369 Standard dmrMon proMnalHO 2308.3 2017.6 1605.0 1680.0 1743.8 1076.6 1657.6 Eadmttad kMraot . AVER-2.7WSTDEV 41650 41745 44103 43127 42074 4246! 43765 154 Clock Around-Wire Length 55000 T

50000 < ■ Numb«rofRow( ■O' Funcfond (OOO)-Wlr* L*oq «i (ic«M H coat) 45000 ■ • • * FuncLfNOOOQHnitfcl Win L*ng9i (tcalod) o Mtolmum pfoblnoi:io 40000 ■ *• Avarago problnol:lO ’ EaUmatad lowaaf AVER* 36000 ■ ■ 2.7WSTDEV

ROW*

65000 Clock Around-Wire Cost

50000 ■ • Numb* of R em ■O' Funcdonal (DOa>-W)r1no Coal *; 45000 • > * * FuncL(NODOO)*Mflai Wiling Coal O* Minimum probing 1:10 40000 ■ ■ Avaraoa proMneulO ■A- Eadmalad kmraal . AVER* 36000 ' ■ 2.«4'STDEV

I • » » I • I *—I • t a | Row*

Figure 58: Line graphs for around-the-ccll routing of a clock chip.

The functional placement strategy concentrated on localized interconnections.

Hence, an advantage over simulated annealing was not sutprising, and the introduction

of wiring channels between adjacent cell rows resulted in a considerable improvement of routing channel size. Distant block connections and cell group exchanges became possible.

The digital clock circuit is somewhat different from the other projects, because more than three rows were necessary for a significant cost improvement. This 155 phenomenon was found also in other circuits but is most pronounced in the clock

configuration. It stems from highly connected cell groups that contribute to many

global, far-reaching nets which can be relocated to the end of cell rows. For this

reason, the functional approach is preferable over simulated annealing when around-

the-cell routing is applied to more than three standard cell rows.

The second set of charts (Figure 59) presents through-the-cell wiring results.

Here, simulated annealing gives results superior to functional placement, because cells that contribute a pin to a multi-pin net can extend the connection point to the switch

box on the opposite side of the cell. Thus, a wire contribution with an average length of a half perimeter cell-row bounding box is saved for each through-the-cell connection. For comparisons of routing channel areas, it is necessary to include a scaled adjustment for through-the-cell contributions. Adjustment values are listed in

Table 4, in the ninth row. Table 5: Data for through-the-cell routing of a clock chip.

A BF a H 1 Ckx* Chto: 149 Cells: 158 Net*: 942 Pins 2 Number of Rows 2 * R o w s 6-Rows 7-Rows 9 - R o w s 3 Maximal Width ot a Row (Lambda) 2 9 0 0 1 0 0 0 6 5 0 7 5 0 4 8 Functional (DOG)-Wlrina Cost 52100 3 9 4 2 6 3 5 1 9 6 3 5 9 3 6 • Functional (DOG)-Wire Lanalh (scaled k> cost] 62109 3 6 4 2 6 3 5 1 9 0 3 5 9 3 9 7 Funct.fNODOOHntttal Wlrina Cost 53745 40790 39939 3 6 6 0 9 • Fund.fNODOGFInlllal Wire Lonath (scaled) 53745 40790 39936 39669 • 1 0 Sim. Annealna on Funct.(NODOO) WWno Cost 1 1 P r o b ln a 1 2 7 5 1 3 3 4 1 2 5 2 9 4 7 2 3 9 9 5 2 1 2 P r o b ln a 2 2 5 9 0 4 2604^ 29295 33941 1 3 P ro fa in o 3 26777 20909 32430 31224 1 4 Probino 4 3 0 4 3 0 3 1 0 7 3 3 1 0 5 9 2 0 9 1 5 1 9 P r o M n o 5 2 5 4 0 1 3 0 3 1 2 2 9 3 9 4 3 2 9 9 4 1 1 P r o b ln a 9 2 5 0 2 1 3 0 1 1 3 3 2 0 1 7 2 9 2 9 9 1 7 P r o b in g 7 3 2 2 4 1 3 1 0 4 4 1 2 9 1 5 2 3 1 0 2 0 1 a P r o b ln a 9 . . ... 2 9 2 2 9 2 9 9 1 1 2 9 0 9 6 2 7 4 3 5 i • Problna 0 30792 26443 29706 33411 20 P r o b ln a 1 0 2 9 4 2 4 3 0 7 9 3 3 1 4 9 9 2 7 4 0 4 21 Minimum oroblnaltlO 2 5 4 0 1 2 0 4 4 0 2 9 0 0 6 2 7 4 3 5 22 Average orobinalrtO 2 7 7 9 2 29973 30072.3 3 1 7 2 2 . 5 23 Standard deviation Droblno1:10 2 4 4 7 2 1 9 9 . 1 4 1 7 1 2 . 2 3 0 7 1 24 Estimated lowest - AVER-2.024*8TDEV 2 1 3 4 1 2 4 2 3 1 . 3 25579.6 22060.9 2 9 29 Sim. Annealna on Funct.(NOOOQ) Wire Lenah 27 P r o b ln a 1 33075 37243 20909 3 9 4 2 2 29 P r o b ln a 2 29441 29506 32019 3 0 0 0 1 29 P r o b ln a 3 30197 33393 34932 3 3 2 9 0 30 P r o b in o 4 3 7 4 0 4 3 4 0 0 3 3 3 7 3 9 3 2 3 3 4 31Probing 5 30729 33193 3 2 5 4 2 3 5 4 2 3 32 P r o b in g 0 30591 32095 34075 3 2 3 9 7 33 Problna 7 36319 33627 31193 3 5 0 0 0 34 P r o b ln a 9 31437 30455 31230 3 0 9 0 5 39 Probino B 34000 29320 32921 36131 39 P r o b ln a 1 0 32002 32911 33997 31444 37 Minimum Dioblna1:10 2 9 4 4 1 2 9 3 2 0 29990 3 0 9 0 5 3 9 Avetaoe om blnalilO 32524 32942.9 32644.5 3 4 2 2 4 . 7 39 Standard deviation Dfobina1:10 2690.4 2357.41 1 6 6 9 . 7 2 2 6 4 0 . 9 9 4 0 Estimated lowest ■ AVER-2.705"STDEV 2 5 1 1 3 2 6 1 2 4 . 4 2 0 3 0 4 . 2 2 6 9 0 6 . 1 157

Clock Through-WIre Length •0000

60000 Numbarof Rom -°* Fur>c*ontf (Doavwif* Langft 40000 (tcafad to coal) • * Funct(NODOOHnHlal Wlra 00000 Langtfi O* Minimum proWngtrlO

20000 - • ■4" Avaraga proMnglttO Eadmattd loaraat ■ AVER* S.7M'8TDEV 10000 -■

0 I Rowa

•0000 Clock Through-WIre Cost

•0000'- -** NumbarofRowm O* FuMtenal (DOO)>WMng Coal 40000 - - Funet(NOOOO)-W«al Wiring Coat ao o o o - - ■O* Minimum problngl;l0 -A- Avanga preblngl:l0 20000 - * -A- Etdmaiad lowatt - AVER* 2.024*BTDEV 1 0 0 0 0 -*

♦ ■I ♦ ■ » I » I | Row*

Figure 59: Line graphs for through-the-cell routing of a clock chip.

Several other evaluations, conducted with a technique known as force directed

pairwise interchange (FDPI) [91], have shown that functional placement is better than

schematic capture post processed by FDPI, see Figure 60. For this reason, functional placement is useful to obtain good results in a reasonable amount of time for layouts employing around-the-cell routing. 158

120000

100000 " *•* Function* (DOOHtoutlne 8 0 0 0 0 " Com Fore* Dtractod PMrwtw eoooo-' InioKftaneo r *■ FuncL(NODOa>-lnW»I 4 0 0 0 0 ' Wlrlno Com

° * AnnoMIng* B«M Romj II 2 0 0 0 0 '' 0 1 2 3.4 S e 7 Rows

Figure 60: Comparison with force directed pairwise interchange.

VIA Low Temperature Cost Improvement

The preceding section has shown that simulated annealing can reduce the wire-

Iength and the total cost of configurations employing through-the-cell routing. This section presents some optimization measures based on a 'good' starting configuration obtained by functional placement. This differs from the annealing strategy used so far, in that the iterative search algorithm is not forced to explore broad areas in the initial stage. Hence, one can eliminate the computationally expensive starting phase and still obtain almost optimal results.

The annealing method implemented in TimberWolf3.2 models a crystallization process with a very high initial starting temperature T which is lowered in exponential steps. Since T determines the freedom (movability) of the cells, almost all moves are accepted in the beginning. This causes a complete randomization, and hence destruction of any good initial configuration in the search for low-cost regions. Much of the time spent in this computationally expensive step can be eliminated, because a good initial placement may be viewed as a local optimum. For this reason, one must select a starting temperature which is low enough to prevent extensive randomization, 159 but sufficiently high to allow for several permutation mistakes. The mistakes are

necessary to escape local cost minim as in the search for a global optimum. Low

temperature experiments were conducted with a modified version of TimberWolf that

allows for the specification of an initial temperature value T and parameters to the

cooling function. Since the optimization depends on T and the course of the

temperature degradation, it is necessary to carefully select a starting value. A value

which corresponds to the average temperature that yields an equivalent configuration

(starting cost) in a high-temperature run would be best. However, there is no method

known to determine that value. Grover [88,92] suggests a temperature value close to

the freezing point obtained on the basis of prior experience. Numerous experiments

have shown that a temperature at which the number of accepted swaps is in the range

six to ten percent of the total cells in the circuit gives best results. A temperature in

this range, combined with a starting configuration produced by functional placement,

produces very good solutions. The bar-charts in Figures 61 and 62 show the results for

four circuit configurations treated with a sequence of neighborhood moves followed

by a limited global search. Numerical results for wire-length and total cost are

summarized in Table 6. Each of the charts contains the findings for a digital-clock, a

frequency counter, an implementation of a Towers of Hanoi algorithm, and a four bit

ALU. The leftmost bar in each group corresponds to the cost for functionally-placed

standard cells employing through-the-cell routing. The other four bars, adjacent to the

right, are based on ten simulated annealing probings. One bar demonstrates the lowest

probing achieved within ten runs, and the other block gives the predicted lowest cost

achievable within a one-sided area (z = -2.326). Both high-temperature bars, given for each circuit, result from optimizations with an initial temperature of 4000000. The remaining blocks correspond to the cost achieved when starting with a drastically reduced temperature, e.g., 150. 160 Clock, Towers and ALU show a significant reduction in cost for either annealing

attempt, whereas the counter presents only a cost increase. This is because the counter

arrangement can be visualized as a bit-serial, systolic chain of toggle-cells with no global interconnects besides clock and stage outputs. Since the structure has a high locality of interconnects, it is natural that any exchange or randomization of cells results in additional wire jogs and consequently a cost increase. The graphs demonstrate clearly the effect of irreversible initial permutation efforts, and the effects are more pronounced for a high starting temperature. The 4-bit ALU circuit behaves similarly because it is purely combinational and has no feedback wires. For this reason, clusters of cells can form subfunctions in a hierarchical fashion so that sharing of common subexpressions by through-the-cell routing becomes possible. Since any displacement of cells belonging to a subfunction results in a high cost penalty, highly- connected subfunctional components become inseparably clustered. The effect of low- temperature annealing is that entire subfunctional clusters are reordered and control signals become vertically available. In contrast to this, significant cost reductions are possible for the Towers of Hanoi and the digital clock implementation, because both circuits contain a large number of feedback wires and many shared signals. Here, low temperature annealing establishes vertical cell alignment and signal sharing and eliminates long interconnects beside the rows.

Optimization results from a number of circuits demonstrate that functional placement combined with low-temperature annealing is both cpu-time and layout-cost efficient. A computation using a high starting temperature takes up to thirteen times as long as a low temperature run, e.g., Towers: 353 cpu-seconds versus 27 cpu-seconds.

The elimination of iterative steps in the beginning of the annealing process is responsible for the time improvement in finding a reasonable global configuration. All results show the expected drop between high-, and low-temperature annealing. This 161 clear drop is due to the selection of the wire-Iength as an optimization criterion.

Results with equally low cost values can also be produced by traditional high-

temperature annealing runs. However, many time consuming tests are required to

compensate for the spread of the results. An estimate can be determined with the

formula given at the end of the second section in this chapter. It requires the substitution of z with z = ^ f , Acost is the cost difference of runs performed at high-, and low-starting temperature T and s is the standard deviation of ten such high- temperature tests. It was found that at least fifteen16 high-temperature runs were necessary to match the average cost attained by low-T annealing with 90% certainty for all circuits.

Past experience and measurements indicate a significant reduction of wire-length and computing time for low-temperature simulated annealing performed on functionally constructed layouts. It is a not surprising result that low-temperature annealing achieves lower cost layout configurations, and requires noticeably less cpu time.

16 162

80000-

«oooo - - ■ NODOQ CoM(mh): f l Mn-HIgh Temp. 30000- □ « - 1.0% htgh Tomp. 30000- ■ G Min-Low Tamp. B « - 1.0% Low Tamp. 10000- •

CUDCK OOLNTER 48rTALU Figure 61: Wire length results for low-temperature annealing.

00000T

80000--

1 1■ — 40000-' ■ NODOQ CoM(mln): ■ Mn+DQh Twnp. 30000'- D <- 1.0% Hfch Tomp G Mn-Laor Tomp. 30000- - B <- 1.0% Low Tomp 10000-■

CUXK OOUNm IUMM 48ITAUI Figure 62: Total cost results for low-temperature annealing.

VI.5. Summary

The measurements and results presented in this chapter show clearly that functional placement is a cost effective and a computationally attractive method.

Simulated annealing which is regarded as being the most area-efficient method for standard-cell placement was chosen for the comparisons. All comparisons were based on four complete designs, of which two have been submitted and successfully Table 6: Data collected for low temperature annealing processes performed on functionally placed layouts.

I b T T TT c____ CLOCK C O J N T ffl T O W g S ALU

Wlrtna Cotit Wiring Ccwte InH latT in itia tT 1 5 0 1 3 0 1 5 0 5 0 0 PiM na 1 2 8 4 7 2 2 4 7 8 3 3 6 2 1 0 4 5 2 9 4 P r o b in g 1 2 9 1 2 5 2 1 7 2 3 3 6 3 8 2 4 2 9 7 7 P r o b in g 2 2 9 2 9 5 2 4 7 3 6 3 7 9 5 9 45294 Probing 2 2 7 5 9 3 2 1 5 3 9 3 6 3 8 2 4 3 4 6 6 P r o b in g 3 32430 24736 3 7 9 4 9 4 5 9 0 3 P r o b in g 3 2 7 9 5 6 2 1 5 3 9 3 6 3 8 2 4 4 5 7 3 P j i - l K W j i A rfOWW < 3 1 0 5 6 2 4 8 3 9 37622 46487 Probing 4 2 7 5 1 6 2 1 5 3 9 3 8 3 3 0 4 5 8 5 0 rluSSS— 2 9 3 9 4 2 4 8 3 9 3 8 5 7 1 45710 Probing S 28620 21684 38439 4 5 0 8 9 10 nopwg o 3 2 6 1 7 2 4 3 4 4 3 8 4 3 3 44577 Probing 6 2 8 1 9 7 2 1 8 1 8 4 0 0 1 0 4 7 2 9 5 PirJ ilnn 7 11 rlD O lig 9 2 8 1 5 2 2 3 6 5 9 3 6 6 2 0 4 6 7 6 8 P r c b in o 7 2 7 6 2 9 2 1 6 8 4 3 7 7 3 0 4 7 2 9 5 Pnilihin A 12 rlOOnH B 26096 23659 37555 4 5 6 6 2 P r o b in g a 2 8 6 6 0 2 1 8 1 6 3 8 1 3 1 4 3 6 6 4 13 riwiQDmMm Ay 2 9 7 0 6 2 4 5 1 5 3 7 4 4 3 4 3 6 7 8 P r o b in g 9 2 8 2 4 4 2 1 6 6 4 3 8 3 6 3 4 3 2 2 5 14 P r o b in g 1 0 3 1 4 9 9 2 5 3 3 4 37449 45594 Probing 10 2 8 2 4 4 2 1 0 0 6 3 9 8 5 8 4 2 4 4 8 15 Min Hah Temp. 28096 23659 36210 4 3 6 7 6 MtvLor Tfitt- 27516 21006 36382 4 2 4 4 8 15 >wrM Hah Tmp. 30072 2 4 5 4 5 3 7 5 8 1 4 5 4 9 7 AwnntLowTwiCi 2 8 1 7 8 2 1 6 0 3 3 8 0 0 1 4 4 5 8 2 17 SktevHW iTamp. 1 7 1 2 5 3 1 7 3 2 8 8 7 SHWv Low Temp. 5 2 3 2 3 5 1 3 2 8 1 7 5 3 18 . 1.0% Htah Tem p. 2 6 0 9 0 2 3 3 1 0 3 5 8 7 6 4 3 4 3 4 1 1 -0 % L cm T w o . 2 6 0 6 1 2 1 0 5 8 3 4 9 1 2 4 0 5 0 5 15 W ire L»na1h» Wire lenalha 20 rroOlfW 1 2 9 8 0 6 2 5 0 8 6 3 7 6 6 5 4 6 8 1 2 3 2 3 4 9 2 2 4 7 7 3 6 2 8 3 4 4 6 6 3 21 rfuOlny < 32016 25215 3 0 3 1 9 4 6 5 1 3 P r o b in g 2 3 0 0 5 1 2 2 1 4 6 3 8 2 4 4 4 4 7 8 2 22 P r o b in g 3 3 4 0 3 2 2 5 6 2 4 40106 47131 Probing 3 *« — j-i A 3 1 5 5 4 2 2 3 8 5 3 7 9 6 4 4 6 4 4 2 2 3 rro&nQ 4 3 3 7 3 9 25274 39612 47478 Prcbina 4 * —■ * r 28976 22149 39983 4 7 1 1 6 2 4 rim m g a 3 2 5 4 2 2 5 4 1 1 4 0 6 3 3 4 7 1 6 6 P rnblnQ 5 3 1 6 8 7 2 2 2 8 5 4 0 5 7 4 4 6 5 4 0 2 5 P r o b in g 6 3 4 0 7 5 2 5 1 6 7 40707 46001 Probing 6 3 0 0 5 4 2 2 4 4 4 4 2 3 1 7 4 8 5 3 7 26 rroomo / 31103 2 4 7 6 6 38281 4 8 1 1 8 P r o b in g 7 2 9 9 0 8 2 2 4 0 4 3 9 2 1 8 4 8 5 3 4 *» ■ «-- m - »- •- A 27 n D P iig P 3 1 2 3 0 24339 39526 4 6 7 3 3 r i m i g p 3136C 22055 40137 4 5 1 2 6 2 6 Probing 9 3 2 9 2 1 2 4 6 1 8 30267 44600 Probing 9 3 0 1 0 6 2 2 0 6 5 4 0 0 8 4 4 4 6 0 7 2 6 P r o b in g 1 0 3 3 8 9 7 2 5 9 2 6 3 0 3 1 2 4 6 8 2 3 P r o b in g 1 0 3 0 5 0 9 2 1 8 3 1 4 2 1 0 5 4 3 5 8 8 3 0 LBn-Hioh Tamp. 2 9 8 9 6 2 4 3 3 9 3 7 6 6 5 4 4 6 0 0 Uln-Low Temp. 2 8 9 7 6 2 1 8 3 1 3 7 9 6 4 4 3 5 8 8 31 Average High Temp. 3 2 6 4 5 2 5 1 6 3 3 9 4 5 2 4 6 7 3 8 Average Low Tamo. 3 0 6 5 6 2 2 2 2 4 3 9 8 9 1 4 5 9 9 4 32 S*dev H ob Temp. 1 5 7 0 4 5 1 9 5 4 0 4 1 Slriev Low Tamp. 1 0 3 6 2 0 9 1 5 2 1 1 7 1 2 33 r 1-0* Hoh Temp. 28093 24113 3 7 2 3 3 4 4 5 4 9

SUMMARY AND CONCLUSIONS

VII.I. Research Contributions

This thesis has presented a methodology for the automatic placement of standard­ cell layouts. It is significantly different from traditional polycell placement schemes in that a layout geometry is specified and constructed from behavioral descriptions. The construction relies on a translation scheme that combines the simplicity of standard cells with the elegance of functional programming. This approach is more suitable for logic description since it hides all implementation details from the designer and naturally leads to hierarchically structured compositions. Hence, complete ICs can be specified under extreme time constraints. As an example, consider the digital clock implementation shown in the Appendix section B. The entire design effort from idea to successful pin-to-pin simulation took less than two weeks. This extremely short design span was mainly possible because, the schematic capture step, normally required with conventional placement procedures, can be omitted. Instead of this time consuming process, circuit analysis becomes the main focus. Circuit analysis consists of logic simulation, timing verification and expression debugging with traditional tools available in every design and programming environment. The beauty of this approach is its simplicity that lies in the unique combination of unambiguous functional programming and the generality of cell-based custom design.

165 166 The research focused on three major areas: (1) development of a method for

efficient and accurate translation of behavioral logic descriptions into a layout

topology, (2) placement and routing optimization, and (3) standard-cell design. In the

course of the research, a concerted implementation of the proposed translation-,

optimization-, and physical layout tool was completed. The practical value of the tool

was shown with several 3pm test structures fabricated under the MOSIS project, and

the positive feedback of other students using this tool to complete their layouts.

As explained in Chapters I and n, traditional placement approaches suffer from at

least one of the following problems: (1) high computational cost; (2) poor layout

quality; (3) inability to construct large layouts from textual high-level language

expressions; or (4) lack of comprehension of local and global design constraints.

Manual placement can resolve some of these drawbacks since layout editors

grant direct control over all placement aspects. However, the large number of low-

level design constraints limits the project size and increases the design time radically.

Constructive or transformational methods allow for larger projects, but the placements

achieved are of low quality. Often the choice of the seed components and the lack of a

global planning strategy is responsible for the poor results. However, such a strategy is

useful to resolve future cell-to-cell connection problems. In contrast to this, iterative

improvement methods consider global issues. This often results in layouts of

comparable or better quality than manually constructed ones. Currently, simulated

annealing is preferred in the industry despite its exponential time requirement. It produces high quality geometries. For this reason, it was chosen for the evaluations in

Chapter VI, The comparison shows that functional placement is a cost effective and computationally attractive method. The time complexity for layout constructions of similar quality compares exponential growth for simulated annealing versus quadratic growth for the proposed approach. A complete description of the constructive steps 167 involved was presented in Chapter HI and IV. The first part in Chapter Three defines

the language for circuit description followed by the compilation and tiling steps for a

complete, optimized layout. Chapter Three focused on expression parsing, tree

flattening, pseudo-code generation and basic circuit construction, and Chapter Four described optimization measures. The goal of the chapter was to give an efficient method for the production of an improved program graph that corresponds to a faster, more efficient standard-cell layout. Several essential construction steps for a directed acyclic graph, such as right recursion elimination, cascaded-inveiter reduction, identifier merging, tree restructuring and common subexpression elimination were shown. Chapter V has presented a logic design method for synthesizing standard cells within the building block family and the routing environment. Algorithmically based cell synthesis is convenient for the construction of large and complex standard cells.

The algorithm was applied by students in the OSU Computer and Information Science

VLSI-design course. Several cells were constructed recently with good results.

The main contributions of this research include: (1) the development of on area efficient, technology independent standard-cell placement method that requires at most 0(n2)-time and 0(log(n))-space. This is the most important achievement of this research, because layouts obtained from functional high-level descriptions are of excellent quality. Responsible for this achievement is the combination of LL(k)- expression parsing, (2) parse tree optimization, os well as (3) direct placement and netlist generation from a threaded directed acyclic graph. These are significant contributions because the combination of both techniques results in faster and denser layouts with fewer logic stages. Another contribution is (4) the standard-cell construction method for duol-row, complementary logic cells. The method has several advantages of which short cell design time, complete latch-up prevention and sophisticated cell interconnections are most important. Among less significant 168 contributions are: (5) the specification of a high-level language formalism allowing for the accurate and unambiguous behavioral description of any digital logic construct, and (6) the implementation of an efficiently coded translator for circuit expressions which complements our existing design environment.

VII.2. Research Extensions

This dissertation has described a complete standard-cell layout methodology for the construction of complete integrated circuits. The research discussed has prompted many more questions to be answered and directions to pursue. The following is a brief description of problems for future study, and extensions to this research which should be investigated:

1. High-level language support(HLL). The architectural design of several

test structures has shown the necessity of iterative language constructs

for the repetition of parallel subcircuits. A parallel n-bit logic for

example, could then be specified only once and repeated n-times as the

body of a counter loop.

2. Simulation. Currently, expression debugging involves the invocation of:

expression compiler (FCOMPILE, LIN2SNAKE), graphics editor,

router, extractor (MAGIC), and simulator (ESIM) in this exact sequence,

to detect logical errors in the code. An extension of the current tool set

with a LISP-like interpreter and debugger seems useful to shorten the

debug cycle. 3. Expression verification. Logic simulators can only test the correctness of

the logic for a limited set of vectors but cannot prove that a design is

error free. For this reason, a reliable method for expression verification

that eventually leads to a correctness proof is desired.

4. Synchronization. The approach assumes that all expressions are properly

retimed [76] so that there is no combinational rippling. For the

connection between peripheral logic and standard cell blocks a

transparent LATCH-cell discussed in Appendix C, is currently available.

Other synchronization measures such as synchronizer cells or automatic

expression retiming con be developed.

5. Testability. Chip design based on functional descriptions is most

convenient when large structures are to be designed. However, chip

testing is difficult, in contrast to the simulation prior to fabrication,

because internals of the circuit are not accessible anymore. Furthermore,

pin-limitations constrain the number of test pins on the package. Thus,

the effons for testability should be directed towards an extension of the

translator scheme for automatic test structure and test-vector generation.

An approach like latch-scanning ([93] p. 102) "along" the folded chain

of cells, seems attractive.

6. Fault tolerance. A major consideration that follows from testability

issues is the ability of a chip to function under certain fault conditions.

This issue is important since, devices on a fabricated chip cannot be

replaced or repaired when found faulty. For this reason, further research

should concentrate on the detection of critical subexpressions, automatic

duplication of critical sections to introduce redundancy. The compiler should be smart enough to generate all necessary control, and

communication links between fault tolerant extensions. It should

produce a testable fault-tolerant chain of cells similar to the processor

chains introduced by Arnold Rosenberg [33].

7. Cluster detection . Similar to purely systolic designs, snake-folded cell

arrangements have a strong connectivity to neighboring cells and hence,

may be visualized as clusters. Such clusters should be treated as entities

in any simulated annealing process to avoid long wire runs. A further

extension to the compiler and the tool preparing the design for simulated

annealing (LIN2TIMBER) is required such that clusters are treated as

inseparable units.

8. Temperature distribution. The importance of temperature distribution on

the substrate of integrated circuits comes from its influence on the

performance and aging process. For this reason, an optimization of the

cell arrangement, namely the folding pattern for a lowered die

temperature, becomes necessary. An analysis method for temperature

estimation of different area effective folding patterns should be

developed in further research efforts. Appendix A

EXAMPLE OF A 4-BIT COUNTER DESIGN

This Appendix presents the design steps for a four-bit synchronous counter and shows the results for several horizontal wiring optimizations. The example was created with an implementation of the proposed functional placement method on a

SUN-3/160 in standard C. Figure 63 illustrates the counter circuit and gives the functional description of the logic. The circuit resembles an accumulator in the increment mode and is composed of four identical toggle-stages which are defined in the beginning of the description. Each stage contains an adder and a register driven by global a clock (CLK1) and a reset (RES!) signal. The function (MAIN) computes a signal that toggles every fifteen transitions of the clock. Note that the actually desired count value is acquired through the output identifiers (A3,C,D). Even though E is included in the output list, it is eliminated during the compile because no signal is assigned to it in the description, therefore no wire or label is created for E in the layout. Identifiers T l, T2, T3 and T4 are merely introduced to improve the readability of the functional description, and their inclusion in the output list is not required.

Obviously, one could rewrite the expression for the counter as:

MAIN=TOG(TOG(TOG(TOG(CIN,A),B),C),D); but a description similar to the one given seems more readable.

171 /* * Example of a 4-bit counter * */

♦define TOG(tempval,name) \ AND(name,tempval);\ name-DFF(AND(RES,EXOR(name,tempval)),CLK)

*MAIN: INPUTS:RES,CIN,CLK; OUTPUTS:A,B,C,D,E,TI,T2,T3,T4; MAIN-T4; T4-TOG(T3,D); T3“TOG(T2,C); T2“T0G(T1,B); T1“T0G(CIN,A); END. MIMN-T4 II El I TTTT on arr

D T3

< an ■rr

C U liJ J 'i!] It 2

an ■rr

B T I lOtK»mm«l.innnl an orr CLK I A

Figure 63: Four-bit counter with adders as incrementers 173 In the parse-tree description below, each node is numbered with its order number in the tree-thread, starting with node zero at the root. Child-pointers to zero are equivalent to nil-pointers and all other numbers refer to actual nodes.

Tree : The parstree consists o£ the following information: current node# « 41 nodename T3* lchild# - 0 rchild# - 0 current node# m 40 nodename CLK* lchild# - 0 rchild# ~ 0 current node# mm 39 nodename T2* lchild# - 0 rchild# - 0 current node# m 38 nodename CLK* lchild# " 0 rchild# " 0 current node# m 37 nodename Tl* lchild# - 0 rchild# - 0 current node# - 36 nodename CLK* lchild# - 0 rchild# - 0 current node# m 35 nodename CIN* lchild# - 0 rchild# - 0 current node# mm 34 nodename CLK* lchild# - 0 rchild# * 0 current node# •• 33 nodename CIN* lchild# - 0 rchild# " 0 current node# p« 32 nodename A* lchild# - 0 rchild# - 0 current node# m 31 nodename EXOR* lchild# - 32 rchild# - 33 current node# mm 30 nodename RES* lchild# - 0 rchild# - 0 current node# m 29 nodename AND* lchild# - 30 rchild# - 31 current node# m 28 nodename OFF* lchild# - 29 rchild# - 34 current node# mm 27 nodename A-* lchild# - 28 rchild# - 0 current node# m 26 nodename AND* lchild# - 27 rchild* - 35 current node# m 25 nodename Tl"* lchild# - 26 rchild# - 0 current node# m 24 nodename B* lchild# - 0 rchild# - 0 current node# mm 23 nodenam EXOR* lchild# - 24 rchild# - 25 current node# - 22 nodenam RES* lchild# " 0 rchild# - 0 current node# m 21 nodenam AND* lchild# - 22 rchild# - 23 current node# - 20 nodenam o f f * lchild# - 21 rchild# - 36 current node# - 19 nodenam B"* lchild# - 20 rchild# - 0 current node# ■* 18 nodenam AND* lchild# - 19 rchild# - 37 current node# mm 17 nodenam T2-* lchild# -18 rchild# - 0 current node# - 16 nodenam C* lchild# - 0 rchild# - 0 current node# - 15 nodenam EXOR* lchild# - 16 rchild# - 17 current node# mm 14 nodenam RES* lchild# - 0 rchild# - 0 current node# mm 13 nodenam AND* lchild# - 14 rchild# - 15 current node# m 12 nodename OFF* lchild# - 13 rchild# - 38 current node# m 11 nodenam C-* lchild# - 12 rchild# - 0 current node# mm 10 nodename AND* lchild# - 11 rchild# - 39 current node# - 9 nodename T3-* lchild# - 10 rchild# « 0 current node# - 8 nodename D* lchild# - C rchild# - 0 current node# - 7 nodename EXOR* lchild# - 8 rchild# - 9 current node# mm 6 nodename RES* lchild# - 0 rchild# - 0 current node# - 5 nodename AND* lchild# - 6 rchild# - 7 current node# - 4 nodename OFF* lchild# - 5 rchild# - 40 current node# - 3 nodename D-* lchild# - 4 rchild# - 0 current node# - 2 nodename AND* lchild# - 3 rchild# - 41 current node# - 1 nodename T4-* lchild* - 2 rchild# - 0 current node# - 0 nodename MAIN* lchild# - 1 rchild# - 0 The treepointer 'ptr' points to node: 0 174 The lists below give the initial placement from "tree-thrcad stretching'', and the wiring information in the form of a complete netlist. Neither of the lists is optimized.

The initial placement is: Raw Nets: EXOR 31 DOG_42 Net A: AND~29 EXOR 31 1 DOG*" 44 AND 26 0 DFF” 28 DFF 28 -I AND*""26 Net Tl: EXOR_23 AND 18 1 DOG_46 EXOR 23 0 AND_ 21 AND 26 -1 Net B: DOG~48 EXOR 23 1 DFF__20 AND_ 18 AND 18 0 DFF 20 -1 EXOR 15 Net T2: DOG SO * AND 10 1 AND” 13 EXOR 15 0 DOG~52 AND 18 >1 DFr” l2 Net C: a n d ” i o EXOR 15 1 EX0R_7 AND 10 0 DOG 54 DFF 12 -1 A N D - 5 Net T 3 : DOG~56 AND 2 1 DFF_4 » EXOR 7 0 AND_2 AND 10 -1 MAIN 0 Net D: EXOR 7 1 * AND 2 0 DFF 4 -1 Net T4: MAIN 0 0 Net CLK: « DFF 4 1 DFF 12 1 DFF 20 1 DFF 28 1 Net CIN: AND 26 1 EXOR 31 0 Net RES: AND 29 1 AND 21 1 AND 13 1 AND 5 1 175 Figures 64 through 67 illustrate several horizontal-wiring optimization steps as described in Chapter IV. Figure 64 shows the non-optimized layout for the counter. It reflects the placement and corresponding netlist given before, and visualizes expensive wire jogs and numerous contacts from metal-to-metal. The next Figure (65) shows the slightly improved counter-circuit layout with doglegs inserted for left- branched result propagation. Note, the reduction of four vias for each dogleg inserted. OH

EXOR AMD DFF AMD EXOR AND DFF AND EXOR AND DFF AND EXOR AND DFF AND eX0R_3i MB.21 DFF.2B u a.tt CX0R_23 MB.Jl DFF_20 M^la EX0«_15 mb. h DFF_12 EX0R_7 m _ i DFF_4 *J0_:

n r t M n 4 4 0 0 ff 4 1ft

Figure 64: Counter with non-optimized horizontal wiring.

EXOR AND DFF AND EXOR AND DFF AMD EXOR AND DFF AMD EXOR AMO DFF AMD tVOR.31 mb.; i D FF.28 MB.24 EXUR.23 MB.2 DFF_20 IW.II EXOR 15 ». i DFF_I2 tXUR_7 BIO DF F_4 W40 i

n n n — ir Iff Figure65: Counter using left-branched result propagation. EXOROFF AMO EXOR OFF AND I NO DFFAND IXOR INI OFF AND

EXOft_23 >_«• EX«t_l

1 ...... si r

■ i Ifca n a la a EXOR INO orr INI EXOR UNO DFF INO EXOR INO OFF iNon EXOR INI OFF H nd

DFF_2B | a O R _ Z 3 OrF_2B , DFF_12 , LXURJ' MO_ 1 DFF_4 i MO_2 U G ’*'1 * -| > 1 ■ r j , tt-ri Figure 67: Counter using over-thc-cell routing. M-S EXOR Dl G AND 0( D DFF DUG Dl IG EXOI

; EX0R.31 AMD.29 JO ..I AND.26 EXOR. £ 3 .

Figure 68: Over-the-cell wire track. Appendix B

LARGE DESIGN EXAMPLES

This appendix contains two complete designs which were laid out with the design methodology described in previous chapters.

B.l. 24-hour Clock

This chip is a complete implementation of a 24-hour clock that is driven by a single-phase 32kHz master-clock. It consists of two separately compiled building blocks: the master-clock divider and the 24-hour logic that drives the BCD-display.

The functional descriptions for both building blocks are given on the following pages.

The clock-logic requires a 1Hz signal so that either the output of the 32Khz divider chain or a master-clock derived from the mains can be used. In the first stage of the circuit, the 1Hz signal is divided by sixty. This accounts for one cycle per minute. The second stage is the minute-counter programmed to reset at a value of fifty-nine. Its value is output as BCD-signals for the display of minutes. The hour- counter is driven with ^ H z . It is structurally similar to the minute stage, but self- resening at twenty-three. Three additional signals are available for counter manipulations: SETMIN, SETHOR and RESET. SETMIN and SETHOR allow for selective modification of either the hour or the minute counter, and RESET initializes the clock to 00:00. The clock-divider chain is a synchronous counter that divides an

179 180 oscillator frequency of 32.768KHz by 2l* and 216 to obtain 1Hz and 2Hz. Functional descriptions for both circuits are given below followed by the photomicrograph of the fabricated test structure.

t...... *•...... • * This £11* contains the description for a simple * digital clock 24h mode. * * * RESET function to initialise clock to 00:00.

• * PHI is the systems 1 Hr one-phase-clock for the circuit * SETM1H allows to set the minutes by appying a low signal * SETHOR same for hours *

...... * ...... * ...... I

/* Toggle cell definition: •/ •define TOGGLE(CARRY,RESET,CLOCK,OUTPUT) \ AMD(OUTPUT,CARRY)/ \ OUTPUT-Orr(AND(XOR(OUTPUT,CARRY),RESET),CLOCK)

/* Gate definitions: */ •define HAND2 (a,b) NOT(AND(a,b)) •define NAMD3 (a,b,c) HOT (AND (AND (a,b), c)) •define NAND4(a,b,C,d) NOT (AND (AND (AND (a,b) ,c) ,d)) •define HAND6(a,b,C,d,e,f) NOT (AND (AND (AND (AND (AND (a,b) ,c),d)e),f)) •define OR(a,b) NOT (AND (NOT (a) , NOT (b)))

/* Hours-up display is ML — counts to 1 */ /• Hours display la KJXH — counts to 0 */ /* Minutes display is GTE — counts to 6 •/ /* Second display la DCBA -- counts to 0 */ /* T'a are for debug to trace carry. */ /* INT line is to initialize the circuit to a avoid X's •/ •MAIN: INPUTS;CXN,PHI,RESET,SETMIN; OUTPUTS: A, B,C,D,E,r,G,H,X, J, K, L,M, CLK; MAIN-T13; /• Do the hours ... */ T13—TOGGLE(T12,RON,RKO,H)I T12-TOGGLE (CIN, RON, RHO, L); RON-NAND2 (NOT(L) ,M) / RHO-OR (AND (NOT (RESET) , PHI), RHOUR) ; RUP-NAND6 (NOT (L) , NOT (K) , NOT (J) , M, I, H) ; Tll-TOGGLE(T10,RHOUR,RMX,K); T10-TOGGLE(T9,RHOUR,RMX, J) ; T9-TOGGLE(T8,RHOUR,RMI,I); T0-TOGGLE(CIN,RHOUR,RMI,H); /• Introduce Button for setting hours *t RMI-OR(AND (SETHOR,RMII) , AND (NOT (SETHOR) ,YF)) ; /* Do the Minutes ... */ RMII-OR(AND (NOT(RESET), PHI) , RMIN); RHOUR-AND(RH,RESET)! RH-AND(HAND4 (NOT(J),NOT(I),H,K),RUP); T7-TOGGLE(T6,RMIN,RES,G); T6-TOGGLE

1C X »4 *4 «4 c4 *4 *4 ** ♦* X uouu u u _ *4 XXX X K X X O mJ t4 «4•4 4 •4 *4 » U u L> OU UO O — id n « n o H H H H H H H H H H X «k % « * % % DMA A NHU a » ox 4 X H H H H H N £ H% f* * X X X X X X W S g tw U O •J o o o o “ SI? X X X X X X X X X XX X o £ * * * fc * «ss « o o o o o o % nnnnnnn X X X X X X ssH * 7 « % « « n % 2 2 2222 n n n n CO CO CO CO u H n «rj VI M : 22S u a 2222 22 H i t X o h n a a as —. fcflwtuhkibli^ x >4 M M M h £ bbbbkbblu % oooooaak b b b b b b b I u I ■ I ■ ■ ■ I O i s l E i s % ao- o x s *4 x i O O a a ■« X « » Q • • i i -wnfc » % M X a o O vn?r)NHo a - sees H H H H *4 *4 pC A • HHHHHHHf C0 n n re H H H H H H % % w < M XDbHOU • • CO S' se Cb 'MhwnvnNHoXlftllllll I • • • 9 1 CD r- id i g

YRSEC-AHD (mS, RESET) / RESET) (mS, YRSEC-AHD YRS-MAHD4(H0r(rC),N0T(W),YArY0); /* Limit countor to 9 o H H H H H U

i Figure 69: Photomicrograph of the 24-hour clock. 183 B.2. Towers of Hanoi

Towers of Hanoi is a fairly simple game involving three pegs and several disks of different size with a center hole. Initially the disks are sotted according to their size and reside on the leftmost peg. The largest disk is found on the bottom. It is the objective of this game to move all disks to the rightmost peg. In the process individual disks are moved from peg-to-peg with the restriction that no larger disk is placed on a smaller one.

This chip implements a functional expression that solves the Towers of Hanoi problem for four disks. The design allows for cascading several chips to increase the number of disks in multiples of four. Michael Kaelbling wrote the descriptions for logic and frame and created a complete layout with the translator- and frame generator-tool in little more than a week. 184

Thu D*c 10 14:31:94 1987 1

•MAIM: INPUTS :PR,CD,CK, HO, N1,N2,N3,NB, OB, R3,TB,Z8; OOTPOTS:AO,BO,CO,A1,B1,C1,A2,B2,C2,A3,B3,C3,OD,TO,Z,DO; MAIN-NOT(Z) f RH-NOT(RS) I Z-AND(ZB,AND(NOT(XOR ,NAND(RH,NAND(NAND(10,NOT(AO)),NAND(NOT(IO) ,A0| ))) ,CK) t J0-AHD(CB,R0) ; BO-OFT (AND (RN,NAND (HAND (JO,NOT (BO)) ,NAND (NOT (JO) ,B0) )) ,CR) ? KO-AND(CC,R0)} co-orr(AND(RH,HAND(HAND(KO,NOT(CO)),NAND(NOT(KO),C0))),CK); 71-HAND (HAND (A1,CA) , AND (NAND(B1,CB) ,HAND (Cl,CO ) ) I R1-AND(N0T(T2),T1) J 11-AND (CA, Rl); Dl-NAND(NOT(D2),NOT(NX))1 Al-Drr (NAND (NAND (R3,D1)-, NAND (RN, NAND (NAND (11, HOT (Al)), NAND (NOT (II), Al) | )),CK) ; Jl—AND(CB,Rl)t Bl-DFF (AND (RN.NAND (NAND (J1,N0T(B1) ),NAND (NOT(Jl) ,B1)) ) ,CK> t Kl-AND (CC, Rl) I Cl-DIT (AND (RN,NAND (NAND (Kl,NOT (Cl)) ,NAND (HOT (Kl) ,C1) )) ,CK) ; T2-HAND(NAND )), CK) J T3-NAND(NAND(A3,CA),AND(NAND(B3,CB),NAND(C3,CC)))J R3-AND (NOT (TB) ,73); 13-AND(CA,R3); D3-HAND(NOT(NB),NOT(H3))f A3-orr(NAND(NAND ,NAND(NOT(N3),AND(NOT(Nl),NO)))))I EV-NOT(OO); CA-Drr(NAND(RN,AND(NAND(EV,CC),NAND(OD,CB))),CK); CB-Drr(NAND(NAND(R9,OD),NAND(RN,NAND(NAND(EV,CA),NAND(OD,CC)1)1,CK); CC-Drr(NAND(NAND(R3,EV),NAND(RN,NAND(NAND(EV,CB),NAND(OD,CA)))),CK)t END. Figu re 70: Photomicrograph of the Towers of Hanoi chip. 00IA Appendix C

CELL CATALOG

This appendix describes all sCMOS standard-cells currently available and gives a definition of the color patterns used in the layouts. The section also contains sample parametric test results of a recent three micron run for the "Towers of Hanoi" chip by a MOSIS affiliate. SPICE-decks for critical paths are given instead of transient charts to allow for accurate SPICE analysis for any desired scaling factor or manufacturer.

186 187

3 Micron sCMOS NOT-Cell Date: 2/5/88 notmag Cell Family NOT(A3) Page 1 of 1

Two-input positive NAND cell.

Description; This sCMOS-cell implements a NOT-function, with a O.OSnscc worst-case delay from the low-driven input to the high output signal (CRYSTAL).

Logic Symbol: Truth Table: Cell Layout; In out Out 0 1 i 0 m Circuit Schematic;

U d d l

Ml I out M2 mm.m m GNDI

SPICE-Dcck created from notsim:

Ml 6 5 1 4 CMOSP L-3.0U W-6.0U M2 6 5 0 7 CMOSN L-3.00 W-4.5U 7777T ,\v

Critical path analysis; R1 4 5 1437 Cl 5 0 .OOOOlpf Ml 4 7 1 1 p1-3,Ou w*6,0u M2 4 7 0 0 n1*3.Ou w*4,5u 188

3 Micron sCMOS AND-Cell Date: 2/5/88 and.mag Cell Family AND(A3) Page 1 of 1

Two-Input positive AND cell.

Description: This sCMOS-celi implements a two-input AND-function, with a 0.63nsec worst-case delay from either low-driven input to the low output signal (CRYSTAL). Logic Symbol: Thith Table: Cell Layout:

a — II B V • a a I 1 a l • a l i i gircuitSchcmatic; EB

Ml MI1 MS It

II MS Ml T77777 "or* •Kll SPICE-Dcck created from and.sim:

Ml 6 5 1 4 CMOSP L-3.0U W-4.5U M2 6 7 1 4 CMOSP L-3.0U W-4.5U M3 9 7 6 8 CMOSN L-3.0U W-4.5U M4 0 5 9 8 CMOSN L-3.00 W-4.5U MS 0 6 10 8 CMOSN L-3.0U W-4.50 M6 1 6 10 4 CMOSP L-3.0U W-6.0U

Critical path analysis; R1 4 5 1718 Cl 5 0 .OOOOlpf R2 6 7 2337 C2 7 0 O.OOlpf Ml 0 7 4 0 n l»3.0u w=4.5u ////. M2 1 7 4 1 p l^.Ou w»6.0u M3 6 11 1 1 p 1**3.Ou ws4.Su M4 0 1 9 0 n 1»3.Ou w«4.5u MS 9 11 6 0 n l«*3.0u w»4.5u 189

3 Micron sCMOS NAND-Cell Date: 2/5/88 nanclmag Cell Family NAND(A3) Page 1 o f l

Two-input positive NAND cell.

Description: This sCMOS-ccli implements a two-input NAND-function, with a 0.09nsec worst-casc . delay from either high-driven input to the low output signal (CRYSTAL).

Lome Symbol: Trnth Table: Cell Layout

n BY 0 0 i 0 t i 1 0 t 1 1 0

Ciicuit Schematic:

n [Ml Ml '» •V Ml

M4

• M il rddd SPICE-Deck created from nand.sim: ■Mi Ml 6 5 1 4 CMOSP L-3.00 W-4.50 M2 6 7 1 4 CMOSP L-3.00 W-4.50 M3 9 7 6 8 CMOSN L-3.0U W-4.5U M4 0 5 9 8 CMOSN L-3.0U W-4.50 C5 10 0 111.OF

Critical path analysis: R1 4 5 1960 Cl 5 0 O.OOlpf Ml 4 9 1 1 p 1-3.Ou w-4.5u M2 0 1 7 0 n 1-3.Ou w-4.5u M3 7 9 4 0 n 1-3,Ou w-4.5u 1 190

3 Micron sCMOS EXOR-Cell Date: VS/88 exor.mag Celt Family EX0R(A3) Page 1 of 3

Two-input positive exor cell. Description; This sCMOS-ccll is a multifunction cell that primarely implements a two-input EXOR function with the logic symbol given below. The logic diagram shows the other functions that can be obtained from the cell with slight modifications (additional or moved contacts).

Logic Symbol: Cell Layout: 35 S = I > V

Truth Tabic; R B V y.-Yii'/ 0 0 0 0 i 1 1 0 1 1 1 0 KKuZ YS///S/. m Vi fr •f mryin

»*i vyy

■1*1 w

////.

■CtfMQaMMfl 191

3 Micron sCMOS EXOR-Cell Date: 2/5/88 cxor.mag Cell Family EX0R(A3) Page 2 of 3

Two-input positive Exor celL

fig u it Schematic;

HI llddl M2

M3

M3 MS

M7 MS

MI2MI3

MI4

MM

MS MID

6NDI

SPICE-Deck created from exor.sim: Ml 6 5 X 4 CMOSP L-3.00 W-6.00 M2 8 7 6 4 CMOSP L-3.0U W-6.00 M3 9 8 1 4 CMOSP L-3.0U W-4.5U M4 11 10 9 4 CMOSP L-3.00 W-4.5U M5 8 5 0 12 CMOSN L-3.0U W-6.00 M6 0 7 8 12 CMOSN L-3.00 W-6.00 M7 11 8 0 12 CMOSN L-3.00 W-6.00 M8 0 10 11 12 CMOSN L-3.00 W-6.00 M9 10 13 0 12 CMOSN L-3.00 W-4.50 M10 14 5 0 12 CMOSN L-3.00 W-4.50 Mil 13 7 14 12 CMOSN L-3.00 W-4.50 M12 13 5 1 4 CMOSP L-3.00 W-4.50 M13 1 7 13 4 CMOSP L-3.00 W-4.50 Ml4 10 13 1 4 CMOSP L-3.00 W-6.00 C15 1 0 319.OF 192

3 Micron sCMOS EXOR-Cell Date: 2/5/88 exor.mag Cell Family EXOR(A3) Page 3 of 3

Two-input positive Exor cell.

Logic Diagram: j — Noata.H

BNDtfl.t)

NflND(M)

Critical path analysis: R1 4 5 2268 Cl 5 0 0.002p£ R2 10 11 160 R3 8 9 1982 C2 9 0 0.002pf R4 6 7 120 Ml 4 9 0 0 n 1-3.Ou w-6.0u M2 6 9 1 1 p 1-3.Ou w-4.5u M3 4 0 7 1 p 1-3.Ou w-4,5u M4 8 0 11 1 p 1-3.Ou w-6.0u M5 10 13 1 1 p 1-3.Ou w-6.0u M6 8 13 0 0 n 1-3.Ou w-6.0u * Initial conditions: .ic v(4)-5.000000 .ic v(5)-S.000000 .ic v (10)-0.000000 .ic v(ll)-0.000000 •ic v{8)-0,000000 .ic v(9)-0.000000 •ic v (6)-5.000000 .ic v(7)-5.000000 vdd 1 0 5.0 vin 13 0 pulse(5 0 0ns 0ns 0ns) .tran 0.02ns 4ns .plot tcan V(5) (0,5) .end 3 Micron sCMOS DFF-Cell Date: 2/5/88 dff.mag Cell Family DFF(Din,Clk) Page 1 of 3

Edge-triggered D flip-flop structure Description; This sCMOS-ccll implements a rising edge triggered master-slave D flip-flop. The delay through the structure is maximallySMnsec for a low driven input at l.OOnscc, and a positive clock edge at 2.00nsec.

Cell Layout 3 Micron sCMOS DFF-Cell Date: 2/5/88 dfT.mag Cell Family DFF(Din,CUt) Page 2 of 3

Edge-triggered D flip-flop structure

Logic Symbol: Transition Diagram:

d a D a l BFF cik

Logic Diagram!

d — US

nit

C LK <

SPICE-Deck created from dff.sim: Ml 1 6 5 4 CMOSP L-3.00 W-4.50 M2 1 8 7 4 CMOSP L-3.00 W-6.00 M3 10 9 1 4 CMOSP L-3.0U W-6.00 M4 5 6 0 11 CMOSN L-3.0U W-6.0U M5 8 12 5 11 CMOSN L-3.0U W-4.5U M6 9 13 7 11 CMOSN L-3.0U W-6.0U M7 7 8 0 11 CMOSN L-3,00 W-6.0U M8 10 9 0 11 CMOSN L-3.0U W-4.50 M9 14 12 9 11 CMOSN L-3.00 W-4.50 M10 12 13 0 11 CMOSN L-3.00 W-4.5U Mil 15 13 8 11 CMOSN L-3.00 W-6.00 M12 0 7 15 11 CMOSN L-3.00 W-4.50 M13 1 13 12 4 CMOSP L-3.00 W-4.50 M14 0 10 14 11 CMOSN L-3.00 W-4.50 M15 1 7 15 4 CMOSP L-3.00 W-4.50 Ml6 14 10 1 4 CMOSP L-3.00 W-4.50 C17 1 0 414.OF 195

3 Micron sCMOS DFF-Cell Date: 2/5/88 dff.mag Cell Family DFF(Din,CIk) Page 3 of 3

Edge-triggered D flip-flop structure

Critical path analysis:

* Nodes 4-5 correspond to 3_54_120 (see gate at 25,39) * Nodes 6-7 correspond to 3^0~56 (see drain at 25,21) * Nodes 8-9 correspond to 3~70~104 (see gate at 31,8) * Nodes 10-11 correspond to q Tsee gate at 40,39) * Nodes 12-13 correspond to 3_102_18 (see drain at 41,28) * Nodes 14-15 correspond to 3238_56 (see gate at 41,28) * Nodes 16-17 correspond to l“ (see gate at 19,9}

R1 14 15 2189 Cl 15 0 0.008p£ R2 6 7 1894 R3 10 11 2292 C2 11 0 O.OlBpf R4 4 5 2004 C3 5 0 0,022pf R5 12 13 1341 C4 13 0 O.OOlpf R6 8 9 2733 C5 9 0 0.003p£ Ml 7 17 4 0 n 1*3.Ou w*6.0u M2 0 9 6 0 n 1*3.Ou w*4.5u M3 1 9 6 1 p 1*3.Ou w*4.5u M4 11 17 8 0 n 1-3.Ou w-6.0u MS 12 0 1 1 p 1*3.Ou w*4.5u M6 13 15 10 0 n 1-3.Ou w-4.5u M7 1 17 14 1 p 1-3.Ou w-4.Su MB 14 17 0 0 n 1-3.Ou w-4.5u

* Initial conditions! .ic v(14)-0.000000 .ic v(15)-0.000000 .ic v (6)*5.000000 .ic v(7)-5,000000 .ic v(10)-0.000000 .ic v(ll)-0.000000 .ic v(4)-5.000000 .ic v(5)-5.000000 .ic v<12>-5.000000 .ic v(13>-5.000000 .ic v(8)-0.000000 .ic v(9)-0.000000

vdd 1 0 5.0 vin 17 0 p u l s e (5 0 0ns 0ns 0ns) .tran 0.05ns 10ns .plot tran V(5) (0,5) .end 3 Micron sCMOS DENA-Cell Date: 2/5/88 dcna.mag Cell Family DENA(Dinfna,COc) Page 1 of 3

Edge-triggered D flip-flop structure with enable.

Description: This sCMOS-ccli implements a rising edge triggered master-slave D flip-flop with enable. The cell is basically identical to the DFF(Din,Clk)-ccll. It is enhanced with a demultiplexer to select either the current output or the input to the cell as input to the embedded master- slave flip-flop. Cell Layout: 197

3 Micron sCMOS DENA-Cell Date: 2/5/88 dcmunag Cell Family DENA(Din,Ena,Clk) Page 2 of 3

Edge-triggered D flip-flop structure with enable.

Logic Symbol: Transition Diagram:

D-X, En«»1 Logic Diagram:

ni ns n«

nti

CL*

SPICE-Dcck created from dena.sim: Ml 7 6 5 4 CMOSP L-3.00 W-6.0U M2 1 7 8 4 CMOSP L-3.00 W-4.50 M3 10 9 1 4 CMOSP L-3.00 W-6.00 M4 12 11 1 4 CMOSP L-3.0U W-6.00 M5 9 14 8 13 CMOSN L-3.00 W-4.50 M6 15 7 8 13 CMOSN L-3.00 W-4.50 M7 11 6 7 13 CMOSN L-3.00 W-4.50 MS 11 16 10 13 CMOSN L-3.00 W-6.00 M9 10 9 15 13 CMOSN L-3.00 W-6.00 M10 12 11 15 13 CMOSN L-3.00 W-6,00 Mil 17 14 11 13 CMOSN L-3.00 W-4.50 M12 14 16 15 13 CMOSN L-3.00 W-4.50 M13 18 16 9 13 CMOSN L-3.00 W-4.50 Ml4 15 10 18 13 CMOSN L-3.00 W-4.50 Ml5 1 16 14 4 CMOSP L-3.00 W-4.50 Ml6 15 12 17 13 CMOSN L-3.00 W-4.50 M17 1 10 18 4 CMOSP L-3.00 W-4.50 Ml8 17 12 1 4 CMOSP L-3.00 W-4.50 C19 1 0 315.OF C20 15 0 355.OF 198

3 Micron sCMOS DENA-Cell Date: 2/5/88 dcna.mag Cell Family DENA(Din,Ena,Clk) Page 3 o f3

Edgc-friggered D flip-flop structure with enable.

Critical path analysis:

* Modes 4-5 correspond CO 3_24_110 (see gate at 15,43) * Nodes 6-7 correspond to q~(aee gate at 41,39) * Nodes 9-9 correspond to 3_74 104 (see gate at 32,8) * Nodes 10-11 correspond to-3j? 50 (see drain at 15,43) * Nodes 12-13 correspond to 3~74_58 (see drain at 26,22) * Nodes 14-15 correspond to 3~40^118 (see source at 20,57) * Nodes 16-17 correspond to 3~58~120 (see gate at 26,39) * Nodes 18-19 correspond to 2**(see gate at 20,9) R1 8 9 2504 Cl 9 0 0.009pf R2 10 11 6066 C2 11 0 O.360pf 83 12 13 1349 C3 13 0 O.OOlpf R4 14 IS 1325 C4 15 0 O.OOlpf R5 6 7 2718 C5 7 0 0.022pf R6 4 5 1933 C6 5 0 0.009pf R7 16 17 1871 C7 17 0 O.OlOpf Ml 7 1 4 0 n 1-3.Ou w»4.5u M2 6 19 9 0 n 1-3.Ou w-6.0u M3 8 17 11 0 n 1-3.Ou w-6.0u M4 10 9 13 0 n 1-3.Ou w-4.Su M5 1 5 14 1 p 1-3.Ou w-4.5u M6 16 1 IS 0 n 1-3.Ou w-4.5u M7 12 19 17 0 n 1-3.Ou w-4.5u * Initial conditionst .ic v (8)-0.000000 .ic v(9)-Q.000000 .ic v(10)-0.000000 .ic v(ll)-0.000000 .ic v(12)-0.000000 .ic v (13)-0.000000 .ic v(14)-5.000000 .ic v(15)-5.000000 .ic v(6)-0.000000 .ic v(7)-0.000000 .ic v(4)-0.000000 .ic v(5)-0.000000 .ic v(16>-5.000000 .ic v(17)-5.000000

vdd 1 0 5.0 vin 19 0 pulse(0 S 0ns 0ns 0ns) ■tran 0.50ns 100ns .plot tran V(S) (0,5) .end 199

3 Micron sCMOS PDFF-Cell Date: 2/5/88 pdff.mag Cell Family PDFF(Din,Phil ,Pht2) Page 1 of 3

Two-phase clocked D flip-flop structure.

Description: This sCMOS-cell implements a two*phasc clocked master-slave D flip-flop. The circuit is identical to DFF(Din,CLK) with the exception of the inverter separating the dock phases.

Cell Layout:

1 I l i H S a & fc : f t n ; ■ it w t t j m 'jT - n : a i l r m

:rui ^ iUi i a

IsiiSjCAUiiieiiasiBifta 200

3 Micron sCMOS PDFF.Cell Date: 2/5/38 pdff.mag Cell Family PDFF(Din,Phi l,Phi2) Page 2 of 3

Two-phase clocked D flip-flop structure.

Logic Symbol: Transition Diagram:

Ph a

Lode D japam

D — >19 >16

HIO >16

Phi 2

Phi 1

SPICE-Deck created from pdff.sim: Ml 6 5 1 4 CMOSP L-3. 0U W-4.5U M2 6 5 8 7 CMOSN L-3. 0U W-6.0U M3 1 10 9 4 CMOSP L-3 .0U W-6.0U M4 12 11 1 4 CMOSP L 3.0U W-6.0U MS 10 13 6 7 CMOSN L« 3.0U W-4.5U M6 11 14 9 7 CMOSN L 3.0U W-6.0U M7 9 10 8 7 CMOSN L-3 .0U W-6.0U M8 12 11 0 7 CMOSN L- 3.0U W-4.5U M9 15 13 11 7 CMOSN L 3.0U W-4.5U M10 16 14 10 7 CMOSN L-3.OU W-6.0U Mil 0 9 16 7 CMOSN L 3.OU W-4.5U M12 0 12 15 7 CMOSN L 3.0U W-4.5U M13 1 9 16 4 CMOSP L 3.0U W-4.5U Ml 4 15 12 1 4 CMOSP L 3.0U W-4.5U C15 1 0 284.OF C16 8 0 187.OF 201

3 Micron sCMOS PDFF-Cell Date: 2/5/88 pdff.mag Cell Family PDFF(Din J*hi 1 JPhi2) Page 3 of 3

Two-phase clocked D flip-flop structure.

Critical path analysis;

* Nodas 4-5 correspond to 3_102_18 (see drain at 41,29) * Nodas 6-? correspond to 3~9B_7 (sea gate at 37,7) * Nodes 9-9 correspond to q~(see gate at 40,391 * Nodes 10-11 correspond to 3_70 104 (see gate at 31,9) * Nodes 12-13 correspond to 3~0_

R1 18 19 1268 Cl 19 0 0.001p£ R2 12 13 1745 C2 13 0 0.187p£ R3 14 IS 1894 C3 15 0 O.OOlpf R4 16 17 2004 C4 17 0 O.OlOpf R5 8 9 2292 C5 9 0 O.OOlpf R6 4 5 1341 C6 5 0 O.OOlpf R7 10 11 2733 C7 11 0 0.009pf R8 6 7 2455 Ml 4 7 1 1 p l»3.0u w-4.5u M2 0 7 4 0n 1-3.Ouw-4.Su M3 6 9 0 0 n 1-3.Ouw-4.Su M4 6 9 1 1p 1-3.Ou w-6.0u MS 8 1 11 0 n 1-3.Ou w-6.0u M6 10 17 13 0 n 1-3.Ou w-6.0u Hi 1 11 14 1 p 1-3.Ou w-4.5u MB 15 1 16 0 n 1-3.Ou w-6.0u M9 17 1 18 0 n 1-3.Ou w-4.5u M10 19 21 12 0 n 1-3.Ou w-6.0u * Initial conditional .ic v(18)-S.000000 .ic v(19>-5.000000 ,1c v(12)-0.000000 .ic v(13)-0.000000 .ic v(14)-5.000000 .ic v(15)-5.000000 .ic v(16)-5.000000 .ic v(17)-5.000000 .ic v(8)-0.000000 ,ic v(9)-0.000000 •ic v(4)-0.000000 .ic v(5)-0.000000 .ic v(10)-0.000000 .ic v(ll)-0.000000 .ic v(6>-5.000000 .ic v<7)-5.000000 vdd 1 0 5.0 vin 21 0 pulse(0 5 Ona 0ns 0ns) .tran 0.50ns 100ns .plot tran V(5) (0,5) .end 202

3 Micron sCMOS LATCH-Cell Date: 2/5/88 Iatch.mag Cell Family LATCH (Din,Pht 1 JPhi2) Page lof3

Two-phase clocked static latch. Description; Hits sCMOS-ccIl implements a two-phase clocked static latch for signal synchronization. The structure can be cleared with equivalent signals on the clock lines. Phil and Phi2 are normally not identical to the system dock. An input signal must be present at least 0.33nsec prior to a phi-transition. The input must be stable for at least 0.80nsec. CslLLayauc

M i l m m l w w i 203

3 Micron sCMOS LATCH-Cell Date: 2/5/88 latch.mag Cell Family LATCH(Din,Phil,Phi2) Page 2 of 3

Two-phase clocked static latch.

Truth Tat]?; D Phl2 — D Q Phil Q LRTCH D 0 t D — r a n H i 0 Q.-i Fhl2 H 0 0 0 H 1 i 0 Lode Diagram: Uddl

D Ml M*

M2

Q

CI2 Phil M2 Phl2

MS

GNO!

SPICE-Deck created from latch.sim:

Ml 6 5 X 4 CMOSP L-3.00 W-6.00 M2 8 7 6 4 CMOSP L-3.00 W-6.00 M3 9 8 1 4 CMOSP L-3.00 W-6.00 M4 0 8 9 10 CMOSN L-3.00 W-4.50 M5 11 9 0 10 CMOSN L-3.00 W-4.50 M6 8 7 11 10 CMOSN L-3.0U W-4.50 M7 12 5 0 10 CMOSN L-3.00 W-6.00 M8 8 13 12 10 CMOSN L-3.00 W-6.00 M9 14 9 1 4 CMOSPL-3.00 W-6.00 M10 8 13 14 4 CMOSPL-3.00 W-6.00 Cll 1 0 158.OF C12 8 0 10 3 .OF 204

3 Micron sCMOS LATCH’Cell Date: 2/5/88 latch.mag Cell Family LATCH(DinFhilJ>hi2) Page 3 of 3

Two-phase clocked static latch.

Critical path analysis; * Nodes 4-5 correspond to q (see gate at 13/6) * Nodes 6-7 correspond to 3_34_90(see gate at 13/35) * Nodes 8-9 correspond to 3~38”l6(see drainat 13,6) * Nodes 10-11 correspond to 3_22_46 (see drain at 7,17) * Nodes 12-13 correspond to 1 (see gate at 25,6) R1 8 9 2016 R2 10 11 1043 R3 6 7 3623 Cl 7 0 0.106pf R4 4 5 3479 C2 5 0 0.008pf Ml 0 7 4 0 n 1-3.Ou w-4.5u M2 4 7 1 1 p 1-3.Ou w-6.0u M3 8 5 1 1 p 1-3.Ou w-6.0u M4 6 13 9 1 p 1-3.Ou w-6.0u M5 10 1 0 0 n 1-3.Ou w-6.0u M6 6 13 11 0 n 1-3.Ou w-6.0u * Initial conditions: .ic v (8)-5.000000 .ic v (9)-5.000000 .ic v(10)-0.000000 .ic v (11)-0.000000 •ic v (6)-0.000000 .ic v(7)«0.000000 .ic v (4)-5.000000 .ic v{5)-5.000000 vdd 1 0 5.0 vin 13 0 pulse(5 0 0ns 0ns 0ns) .tran 0.05ns 10ns .plot tran V(5) (0,5) .end 3 Micron sCMOS Date: 2/5/88 dog.mag DOG’Cell Cell Family Page 1 of I

Intcicell-conncction support cell

Bgsgription; th is sCMOS-celi contains no active logic, and merely serves as connection between adjacent cells for result propagation,

Cell Layout: 206

3 Micron sCMOS TIEOFF-Cell Date: 2/5/88 ticoff.mag Cell Family TIEOFF0 Page 1 of 1

Reference low-voltagc cell for ESD-frcc input termination.

Description; ThcTIEOFFO sCMOS standard cell provides an BSD (electro static discharge) free low- level signal for unused input tcrmlation or constant voltage source. The cell is required for the connection of internal ceil inputs to logic level LOW1. A direct connection of an input directly to GND! is not recommended to avoid exposure of input gates to electrostatic discharge and unbuffered noise on the power lines. TTEOFFO can drive up to 75 static input loads. Cell Layout:

Logic Diagram:

'ty r ///. LQ1VI

Circuit Schematic;

MS Ml M2

lawi vTvvv'. MS M4 A///.* :* +** rrt MS SNDI bliWs.

SPICE-Deck created from tieoff.sim: i w l ' ' ' : ? : Ml 1 6 5 4 CMOSP L-3.00 W-9.OU M2 5 5 1 4 CMOSP L-3.OU W-9.OU M3 8 6 0 7 CMOSN L-3.OU W-9.OU M4 5 5 8 7 CMOSN L-3.OU W-10.SU M5 0 5 6 7 CMOSN L-3.OU W-10.5U M6 1 5 6 4 CMOSP L-3.OU W-13.5U C7 1 0 311 .OF C8 5 0 144.OF C9 6 0 101 .OF il 207

3 Micron sCMOS Date: 2/5/88 shcets.mag Sheet Patterns Cell Family Page 1 of 2

Pattern definitions for different layers.

Description: This page contains all patterns used by the design system for sCMOS-layer visualization.

♦Polyailicon mil •-polycontact

+nd1ffua1on

+ndcontact i l i i i +«ota1-l

| i i m II lit! P issiisssiiii iSI I p II till p i I +pdiffuaton it«s Uiii i i i I; lilt iiiiil lili 4*4444*4*44* iiiiiiiiiii: +n2contact iiiiiiissii: 44444444444 *

♦motal-2

vXv +nfet \v •

m:: |!t1

II 1 I 1,1 i it t • *,1.1 ! I. tpuol1 +pfot 208

3 Micron sCMOS Date: 2/5/88 shccts.mag Sheet Patterns Ceil Family Page 2 o f2

Pattern definitions for different layers.

Bssciipfon; This page contains all patterns used by the design system for sCMOS-layer visualization.

+psubstr*toc

+D8Ubotratoc

+pohmtc

+nolinfc 209

3 Micron sCMOS Date: 2/5/881 MOSIS Parameter* Cell Family I Page 1 o f2 Mosia parametric test results. Copy from run M78W - "Towers of Hanoi Chip"

MOSI5 PARAMETRIC TEST RESULTS

RUN) M7BW / WEI-CKENC VENDOR) UTMC TECHNOLOGYi SCP FEATURE SIZE) 3.0ura I. INTRODUCTION. Thla report contains the lot average results obtained by HOSIS from measurements of the MOSIS test structures on the selected wafers of this fabrication lot. The SPICE and/or BSIN parameters obtained from slmlliar measurements on these wafers are also attached. COMMENTS) This looks like a typical UTMC (United Technologies) Sum run.

PARAMETERS) W/L N-CHANNEL P-CHANNEL UNITS Vth (Vds-.05V> 4.5/3 .855 -.907 V Vth

3 Micron sCMOS I „ I Date:2/5/881 Cell Family | M0SIS Parameters | page2of2 |

Mosis parametric test results. Copy from run M78W * "Towers of Hanoi Chip"

V. CAPACITANCE N P KETAL METAL PARAMETERSt POLY DIFF DIFF 1 2 UNITS Area Cap .059 .384 .141 .028 .020 fF/um** 2 (Layer to subs) Area Cap ----- .629 .460 .029 .038 fF/um**2 (Layer to Poly) Area Cap ------.02B fF/um*•2 (Layer to Ketall) Fringe Cap ----- .577 .458 ------fF/um (Layer to subs) Edge Cap .013 ------.008 .117 fF/um COMMENTSt Theae parameter* look normal.

VI. CIRCUIT PARAMETERS! Vlnv, K - 1 2.18 V Vlnv, K - 1.5 2.41 V Vlow, K - 2.0 0.00 V Vhlgh, K - 2.0 5.00 V Vlnv, K - 2.0 2.54 V Gain, K - 2.0 -11.53 Ring Oscillator Frequency 17.70 MJ COMMENTS! Theae parameter* look normal.

M78W SPICE PARAMETERS .MODEL CMOSN NMOS LEVEL-2 LD-0.522076U TOX-420.00E-10 ♦ NSUB-1.16095EH6 VTO-0.805B02 KP-4.27978E-05 GAMMA-0.75506 ♦ PHI-0.6 UO-520.543 UEXP-0.204175 UCRIT-160717 *■ DELTA-1.99933 VMAX-69007.7 XJ-0.400000U LAMBDA-0.0134721 ♦ NFS-9.99333E+11 NEFF-1.001 NSS-1E*12 TPG-1.000000 ♦ RSH-15.5B CGDO-4.29221E-10 CGSO-4.29221E-10 CGBO-6.537E-10 - CJ-0.00039B2 MJ-0.473000 CJSW-6.301E-10 MJSW-0.291100 PB-0.700000 ♦ Weff - WDrawn - Delta_W ♦ The suggested Delta_w - 0.7951 UM .MODEL CM0SP PMOS LEVEL-2 LD-0.B10200U TOX-420.00E-10 NSUB-2.3139E-15 VTO--0;B64799 KP-1.805E-05 GAMMA-0.3361 ♦ PHI-0.6 U0-216 UEXP-0.292069 UCRIT-29714.8 «■ DELTA-0,41003B VMAX-16340,9 XJ-0.400000U LAMBDA-0.0830097 ♦ NFS-1E*11 NEFF-0.01001 NSS-1E+12 TPG— 1.000000 RSH-33. 36 CGDO-6,66IE-10 CGSO-6.661E-10 CGBO-4.704E-10 ♦ CJ-0.0001313 KT-0.447100 CJSW-4.729E-10 MJSH-0,256000 PB-0.750000 ♦ Weff • WDrawn - Delta_W ■ The suggested Delta_W - 0.5722 UM BIBLIOGRAPHY

[1] G. Hamachi. An Obstacle-Avoiduing Router for Custom VLSI. PhD thesis, University of California, April, 1986. [2] S. Sahni and A. Bhatt. The Complexity of Design Automation Problems. In Proceedings of the 17th Design Automation Conference , pages 402-411. 1980. [3] S. Kirkpatrick, C. Gelatt and M. Vecchi. Optimization by Simulated Annealing. SCIENCE 220(4598):671-680, May, 1983. [4] C. Sechen and A. Sagiovonni-Vintencelli. The Timbenvolf Placement and Routing Package. IEEE Journal of Solid-State Circuits SC20(No.2):510-522, April, 1985. [5] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley, 1980. [6] J. Ullmann. Computational Aspects o f VLSI. Computer Science Press, 1983. [7] G. Hamachi. Designing Finite State Machines with PEG. Technical Report, University of California, November, 1985. [8] N. Weste, K. Eshraghian. Principles of VLSI Design. Addison-Wesley, 1985. [9] R. Rudell. Espresso. Technical Report UCB/CSD 86/272, University of California, 1986. [10] J. Hopcroft, J. Ullmann. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1985.

211 212 [11] P. Ammon. Gate Arrays. Dr. Alfred Hueting Veriag Heidelberg, 1985. [12] S. Muroga. VLSI Systems Design. John Wiley, 1982. [13] K. Buchenrieder. An Approach to Automize CMOS Gate Array Cell Personalization and Intergate Routing. Class Report. 1985 [14] P. Hicks (editor). Semi-Custom 1C Design and VLSI. Short Run Press, Exeter, 1983. [15] H. Hoerbst, M. Nett and H. Schwaerzel. VENUS - Entwuifvon VLSI-Schaltungen. Springer Veriag, 1986. [16] Electronic Journal. Bipolare als Macrozellen-Arrays. Electronic Journal (19):28-32,1984. [17] C. Ling. Layered Multiprocessor Architecture Design in VLSI for Real-Time Robotic Control. PhD thesis, Ohio State University, December, 1986. [18] MECL10K: Macrocell Array Design Manual 1979. [19] B. Preas and C. Gwyn. Methods for hierarchical automatic layout of custom LSI circuit masks. In Proceedings of the 15 th Design Automation Conference, pages 206-212. IEEE, 1978. [20] H. Terai, M. Hayase and T. Kozawa. A Routing for Mixed Array of Custom Macros and Standard Cells. In Proceedings of the 22nd Design Automation Conference, pages 503-508. IEEE, 1985. [21] E. Reingold and K. Supowit. A Hierarchy-Driven Amalgamation of Standard and Macrocells. In CADS, pages 3-11. IEEE, 1984. 213 [22] R. Lipton, S. North, R. Sedgewick, J. Valdes and O. Vijayan. ALI: A Procedural Language to Describe VLSI Layouts. In Proceedings of the 19th Design Automation Conference, pages 467-474. IEEE, 1982. [23] D. Johanssen. Bristle Blocks: a silicon compiler. In Proceedings Caltech Conference on VLSI. 1979. [24] J. Fox. The MacPitts Silicon Compiler. A view from the telecommunications Industry. VLSI Design :30-37, May/June, 1983. [25] J. Southard. MacPitts: An Approach to Silicon Compilation. IEEE Computer Magazine :74-82, December, 1983. [26] T. Blackman, J. Fox and C. Rosenbrugh. The SILC Silicon Compiler Language and Features. In Proceedings of the 22th Design Automation Conference , pages 232-237. IEEE, 1985. [27] B. Karger and P. Karger. Automatic Placement: A Review of Current Techniques. In Proceedings of the 23 Design Automation Conference , pages 622-629. IEEE, 1986. [28] M. Burstein. Channel Routing. Technical Report RC10973, IBM T.J. Watson Research Center, Yorktown Heights, NY, February, 1985. [29] M. Garey and D. Johnson. and Intractability: A Guide to NP-Completeness. Freeman, San Francisco, 1979. [30] E. Horowitz and S. Sahni. Fundamentals o f Computer Algorithms. Computer Science Press, 1978. [31] S. Chang. The Generation of Minimal Trees with a Steiner Topology. Journal o f the Association for Computing Machinery 19(4):699-711,1972. [32] U. Lauther. A Min-Cut Placement Algorithm for General Cell Assemblies Based on a Graph Representation. In 16th. Design Automation Conference , pages 1-10. IEEE, 1979. 214 [33] A. Rosenberg. The Diogenes Approach to Testable Fault-Tolerant Arrays of Processors. In IEEE Transactions on Computers , pages 902-910. IEEE, 1983. [34] C. Sechen. The TIMBERWOLF3.2 Standard Cell Placement and Global Routing Program University of California, 1986. Users Guide for Version 3.2; Release 2. [35] D. Krekelberg, G. Sobelman and J. Chu. Yet Another Silicon Compiler. In Proceedings of the 22nd Design Automation Conference , pages 176-182. IEEE, 1985. [36] R. Lipsett, E. Marschner and M. Shahdad. VHDL-The Language. IEEE Design & T est:28-41, April, 1986. [37] J. Monteiro da Mata. Allende: A Procedural Language for the Hierarchical Specification of VLSI Layouts. In Proceedings of the 22nd Design Automation Conference , pages 183-189. IEEE, 1985. [38] C. Mead and L. Convey. Introduction to VLSI Systems, Addison-Wesley, 1980. [39] M. Shadad. An Overview of VHDL Language and Technology. In Proceedings of the 23rd Design Automation Conference, pages 320-326. IEEE, 1986. [40] R. Ayres. IC Specification Language. In Proceedings of the 16th Design Automation Conference, pages 221-223. June, 1979. [41] L. CardeUi and G. Plotkin. An Algebraic Approach to VLSI Design. In Transactions,Circuits and Systems, Very Large Scale Integration: First International Conference on VLSI, pages 173-192. IEEE, 1981. [42] D. Lathi. Applications of a Functional Programming Language. Tech. Report CSD-810403, UCLA Comp, and Inf. Science Dept., Los Angeles, CA. April, 1981 215 [43] S. Johnson. Synthesis of Digital Designs from Recursion Equations. PhD thesis, Indiana University, May, 1983. [44] J. Lewis, A. Berlin, A. Kuchinsky, and P. Yip. Integrated Circuit Procedural Language. Hewlett-Packard Journal :4-10, June, 1986. [45] F. Meshkinpour and M. Ercegovac. A Functional Language for Description and Design of Digital Systems: Sequential Constructs. In Design Automation Conference , chapter 17, pages 238-244. IEEE, 1985. [46] D. Patel, M. Schlag and M. Ergegovac. vFP: An Environment for the Multi-level Specification, Analysis, and Synthesis of Hardware Algorithms. [47] J. Albert and T. Ottmann. Automaten, Sprachen und Maschinen fuer Anwender. Bibliographisches Institut, Wissenschaftsverlag, Zuerich, 1983. [48] F. Bauer and J. Eickel. Compiler Construction; An Advanced Course. Springer Veriag Berlin, 1974. [49] M.Hanan and J. Kurtzberg. Placement Techniques. In Design Automation o f Digital Systems, chapter 5, pages 213-282. Englewood Cliffs, 1972. [50] M. Hanan, P. Wolff Sr. and B. Agule. Some Experimental Results on Placement Techniques. In Proceedings o f the 13 th Design Automation Conference, pages 214-224. IEEE, 1976. [51] D. Schuler, E. Ulrich. Clustering and Linear Placement. In Proceedings o f the 9th Design Automation Workshop, pages 50-56. 1972. [52] M. Breuer. A Class of Min-Cut Placement Algorithms. In Proceedings o f the 14th. Design Automation Conference, pages 284-290. IEEE, 1977. [53] B. Kemighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell Systems Technical Journal :291-307,49:2,1970. 216 [54] H. Shiraishi and F. Hirose. Efficient Placement and Routing Techniques for Master Slice LSI. ACM. 1980 [55] E. Lawler and D. Wood. Branch and Bound Methods: A Survey. Operations Research 14(4):699-719, July-August, 1966. [56] C. Sechen. The TimberWolf 3.2 Standard Cell Placement and Global Routing Program. Technical Report, University of California, March, 1986. [57] J. Darringer, W. Joyner, L. Berman and L. Trevillyan. Logic Synthesis Through Local Transformations. fBMJ. RES. Develop. 25(4):272-280, July, 1981. [58] J. Darringer, D. Brand, W. Joyner and L.Trevillyan. LSS: A System for Production Logic Synthesis. Tech. Report RC 10577 (#47021) 5/2/84 CIS. May, 1984 [59] V. Berstis, D. Brand and R. Nair. An Experiment in Silicon Compilation. In Proceedings of the 1985ISCAS, pages 655-658. IEEE, 1985. [60] I. Matsumoto, F. Niimi, M. Wasc, T. Sugimoto and K. Takahashi. Hierarchical Logic Synthesis System for VLSI. In Proceedings of the 1985 ISCAS, pages 651-654. IEEE, 1985. [61] T. Shimizu, Y. Takamine, T. Shinsha and T. Kubo. A Logic Synthesis Algorithm for the Design of a High Performance Processor. In Proceedings of the 1985 ISCAS, pages 407-410. IEEE, 1985. [62] K. Enomoto, K. Nakajima and S. Murai. Logic Synthesis with Macro Expansion and Factoring. In Proceedings of the 1985 ISCAS, pages 659-662. IEEE, 1985. [63] W. Cohen, K. Bartlett and A. deGeus. Impact on Metarules in a Rule Based Expert System for Gate Level Optimization. In Proceedings o f the 1985 ISCAS, pages 873-876. IEEE, 1985. [64] A. Aho, R. Sethi and J. Ullmann. Compilers, Principles, Techniques and Tools. Addison-Wesley, 1985. [65] R. Hunter. The Design and Construction of Compilers. John Wiley, 1981. 217 [66] F. Pagan. Format Specification o f Programming Languages. Prcntice-Hall, 1981. [67] O. Goos and W. Waite. Compiler Construction. Springer-Verlag NY, 1983. [68] A. Aho, J. Hopcroft and J. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983. [69] A. Aho, R. Sethi and J. Ullman. A Formal Approach to Code Optimization. StGPLAN Notices 5 :86-100, July, 1970. [70] A. Aho and J. Ullman. The Theory o f Parsing, Translation and Compiling. Prcntice-Hall, 1973. [71] S.A Grcibach. A New Normal Form Theorem for Context-Free Phrase Structure Grammars. JACM :42-52, December, 1965. [72], J. Foster. A Syntax Improving Program. Computer Journal :31-34, May, 1968. [73] D. McCracken. A Second Course in Computer Science with Pascal. John Wiley, 1987. [74] J. Davidson and C. Foster. The design and application of a rctargetable peephole optimizer. TOPLAS: 191-202,2:2,1980. [75] A. Tannenbaum, H. Stavercn and J. Stevenson. Using Peephole Optimizations on Intermediate Code. In Transactions on Programming Languages and Systems , pages 21-36. ACM, 1982. [76] C. Leisserson and J. Saxe. Optimizing Synchronous Systems. In IEEE Proceedings on CAD, pages 23-36. IEEE, June, 1981. [77] R. Raghavan and S. Sahni. Optimal Single Row Router. In Proceedings of the 19th Design Automation Conference, pages 38-45. IEEE, 1982. [78] T. Uehara and W. vanCleemput. Optimal Layout of CMOS Functional Arrays. In Transactions on Computers, VOL.5, pages 305-312. IEEE, May, 1981. [79] Texas Instniments Publications. 2-jim CMOS Standard Cell Data Book. Texas Instruments Incorporated, 1986. [80] F. Wanlass and C. Sha. Nanowatt Logic Using Field-Effect Metal-Oxide Triodes. In Solid State Circuit Conference Philadelphia PA. IEEE, 1963. [81] 0 . Zimmer. CMOS • Technologic. Oldenburg Veriag, 1982. [82] C. Term an (MIT). Berkeley CAD Tools User's Manual ESIM University of California, 1986. [83] J. Ousterhout. Berkeley CAD Tools User's Manual ESIM University of California, 1986. [84] L. Nagel. SPICE2: A Computer Program to Simulate Semiconductor Circuits. Technical Report, University of California, May, 1975. [85] L. Blank. Statistical Procedures for Engineering, Management and Science. McGraw-Hill, 1980. [86] B. Kemighan and D, Ritchie. Programmieren in C. Hanser Veriag Muenchen, 1983. [87] S. Dowdy and S. Wearden. Statistics for Research. ' John Wiley, 1983. [88] L. Grover. Simulated Annealing using Approximate Calculations. Technical Report, AT&T Bell Laboratories, Murray Hill NJ, 1987. [89] M. Hartoog. Analysis of Placement Procedures for VLSI Standard Cell Layout. In 23rd Design Automation Conference , pages 314-319. IEEE, 1986. [90] P. Denning, J. Dennis and J. Qualiz. Machines, Languages and Computation. Prcntice-Hall, 1978. 219 [91] S. Kang. Linear Ordering and Application to Placement. In Proceedings of the 20th Design Automation Conference , pages 457-464. IEEE, 1983. [92] L. Grover. GRIM: A Fast Simulated Annealing Program for Standard Cell Placement. Technical Report, AT&T Bell Laboratories, Murray Hill NJ, 1987. [93] F. Tsui. LSI/VLSI Testability Design. McGraw-Hill Book Company, 1986.