<<

2009 19th IEEE International Symposium on Computer Arithmetic

Datapath Synthesis for Standard-Cell Design

Reto Zimmermann DesignWare, Solutions Group Synopsys Switzerland LLC, 8050 Zurich, Switzerland [email protected]

blocks consist of a collection of SOPs, shifters and se- Abstract lectors. An SOP furthermore is composed  just like a simple multiplier  of partial-product generation Datapath synthesis for standard-cell design goes (PPG), carry-save addition (CSA, Wallace tree) and through extraction of arithmetic operations from RTL carry-propagate addition (CPA). All extracted opera- code, high-level arithmetic optimizations and netlist tions can handle inputs and outputs in carry-save num- generation. Numerous architectures and optimization ber representation, which allows to keep internal results strategies exist that result in circuit implementations in carry-save and therefore to improve performance by with very different performance characteristics. This eliminating internal carry propagations [1]. work summarizes the circuit architectures and tech- niques used in a commercial synthesis tool to optimize 3. Datapath Optimization cell-based datapath netlists for timing, area and power. After datapath extraction high-level arithmetic op- 1. Introduction timizations are carried out, such as common sub- expression sharing or unsharing, sharing of mutually Design Compiler is the Synopsys platform to syn- exclusive operations, comparator sharing, constant thesize RTL code into a standard-cell netlist. Arithmet- folding and other arithmetic simplifications. Sum-of- ic operations are first extracted into datapath blocks, products are converted to product-of-sums where bene- undergo high-level arithmetic optimizations and are ficial (A*B+A*C  A*(B+C)). Number representa- finally mapped to gates through dedicated datapath tions for intermediate results are selected among bi- generators. This paper highlights the circuit architec- nary, carry-save and partial-product based on the de- tures and optimization techniques used to generate op- sign constraints and optimization goals. Generally bi- timized netlists for a given design context. Special nary representation is used for area optimization and emphasis is given to the relatively new area of power carry-save for delay optimization. optimization. The paper does not go into any technical details, but rather gives an overview of what mostly 4. Datapath Generation known and published architectures and techniques are successfully applied in our cell-based synthesis flow. Flexible context-driven generators are used to im- plement the gate-level netlist for a datapath under a 2. Datapath Extraction given design context. This context includes input ar- rival times, output timing constraints, area and power Arithmetic operations  such as addition, multipli- constraints, operating conditions and cell library cha- cation, comparison, shift, as well as selection  are first racteristics. extracted from the RTL code into datapath blocks. Thereby largest possible clusters are formed by includ- 4.1. Partial-Product Generation ing all connected operations into one single datapath block. This allows the subsequent high-level optimiza- Partial products have to be generated for multiplica- tions and number representation selection to be most tion operations. The following generators are included: effective. 1. Constant multiplication: CSD (canonic signed- The most generic arithmetic operation that is ex- digit) encoding of constant input to minimize the tracted is the sum-of-products (SOP), which also covers number of partial products that need to be summed simple addition, subtraction, increment, multiplier and up. even comparison. Therefore the extracted datapath 2. Binary multiplication:

1063-6889/09 $25.00 © 2009 IEEE 207 DOI 10.1109/ARITH.2009.28

a. Brown (unsigned), Baugh-Wooley (signed): imum fanout (= net loads). The following schemes are Best architectures for small word lengths (8- used to optimize the for a given context: 16 bits). 1. Parallel-prefix structure: A prefix graph optimiza- b. Radix-4 Booth: Good area, best delay for tion algorithm is used to generate area-optimized large word lengths. flexible prefix structures (unbounded fanout) for c. Radix-8 Booth: best area for large word given timing constraints [1]. The algorithm is able lengths, but longer delay. to generate a serial-prefix structure for very loose constraints, a fast parallel-prefix structure for tight 3. Special multiplication: constraints, and a mix of both for constraints in a. Square: Implemented as A*A, bit-level gate between. It takes bit-level input arrival times and simplification and equivalent bit optimization output required times into account and therefore before carry-save addition automatically re- adapts the prefix structure to any arbitrary input sults in an optimized squarer structure. and output timing profile, including late arriving b. Blend: Partial-product bits of the blend opera- single bits or U-shaped input profiles of the final tion A*B+~A*C are generated using multip- adder in multipliers. This helps reduce circuit area lexers to reduce the number of product bits. but also delay for the most critical input bits. 4. Carry-Save multiplication: One input is carry- Bounded fanout prefix structures (like Kogge- save, which allows to implement product-of-sums, Stone) can also be generated, but are not opti- like (A+B)*C, without internal carry propagation. mized for non-uniform timing profiles. Delay is reduced at the expense of larger area. 2. Sum bit generation: In the carry-lookahead a. Non-Booth: Carry-save input is converted to scheme the prefix structure calculates (look- delayed-carry representation (through a row of ahead) all the carries, which then compute the sum half-adders), then pairs of mutually exclusive bits. In the carry-select scheme the two possible partial-product bits can be added with a sim- sum bits for a carry-in of 0 and 1 are calculated in ple OR-gate. advance and then selected through a series of mul- tiplexers controlled by the carries from all levels b. Booth: Special Booth encoder for carry-save of the prefix structure. This latter scheme is used input. in carry-select and conditional-sum adders. All of the above architectures prove beneficial un- 3. Prefix signal encoding: The prefix structure can der certain conditions and constraints. More informa- either compute generate/propagate signal pairs tion and references on the individual architectures are using AND-OR gates (optimized into AOI/OAI available in [1]. structures) or carry-in-0/carry-in-1 signal pairs 4.2. Carry-Save Addition using . 4. Prefix radix: The most common parallel-prefix All partial-product bits are added up by reducing adders use a radix-2 prefix structure (i.e., each every column from N inputs bits down to 2 output bits prefix node processes two input signal pairs to (carry-save representation). Column compression is produce one output pair). Radix-3 and radix-4 done using compressor cells, such as half-adders (2:2 prefix adders on the other hand 3 or 4 in- compressor), full-adders (3:2 compressor) and 4:2 put pairs in each prefix node. The resulting prefix compressors. The timing-driven adder tree construc- node logic is more complex but the prefix struc- tion considers input arrival times as well as cell pin-to- ture is shallower and thus potentially faster. The pin delays, resulting in delay-optimal adder trees that high-radix parallel-prefix adders only give better are better than Wallace trees [2]. performance if special multilevel AND-OR-gates (like AOAOI) are available in the library. 4.3. Carry-Propagate Addition 5. Special architectures: Ling adder and spanning- The carry-save output from carry-save addition is tree adder architectures are also supported. converted to binary through a carry-propagate adder. All of the above schemes can be combined almost Since carry propagation is a prefix problem, most adder arbitrarily. This allows the implementation of well- architectures can be categorized as different kinds of known architectures like the Brent-Kung, Ladner- parallel-prefix adders. These adders share a prefix Fisher and Kogge-Stone parallel-prefix adders, carry- structure to propagate carries from lower to higher bits. select adders, conditional-sum adders, Ling adder, but A wide range of prefix structures exists that fulfill dif- also many more mixed alternatives. ferent performance requirements. These differ in terms Most of the described more special adder architec- of depth (= circuit speed), size (= circuit area) and max- tures only give superior performance under rare condi-

208

tions, if at all. The regular Ladner-Fisher parallel- carry-save representation. This optimization, however, prefix adder (unbounded fanout, carry-lookahead currently poses problems in formal verification. scheme, generate/propagate signal encoding, radix-2 prefix) with flexible prefix structure gives best and 5. Low Power Datapath most reliable performance in most cases and therefore is selected most of the time. Datapaths often consume large amounts of dynamic power due to their large circuit size and high switching 4.4. Shifters activity during operation. Therefore our recent focus was to lower dynamic power dissipation in synthesized Shifters are implemented with multiplexers or datapath circuits. AND-OR structures depending on the context. To im- prove timing the levels inside the shifter can be reor- 5.1. Operand Isolation dered based on the delay of the shift control inputs. The special case where a shifter actually implements a One of the most effective way to save dynamic decoder (X=1<

Selectors are as well implemented with multiplexers 5.2. Transition-Probability-Based or AND-OR structures based on context. They can Optimizations have arbitrary number of inputs. Selectors also support carry-save inputs, which allows to move them around In many applications transition probabilities inside a datapath (e.g., from the binary output to an (switching activities) are not evenly distributed among internal carry-save operand) in order to enable addi- all inputs. Lower activities on some input operands or tional sharing. some input bits can be exploited to reduce overall ac- tivity and power. 4.6. Architecture Selection 5.2.1. Transition-Probability-Based Adder Tree During synthesis all applicable architectures de- In applications like signal processing input samples scribed in the previous sections are evaluated and can be correlated and higher bits can have lower transi- costed for timing, area and power. The architecture tion probabilities. The following techniques from the that fulfills the constraints best for a given cell library literature were investigated but found not feasible for is selected. The use of specific architectures can also cell-based synthesis: be manually controlled through dedicated .  Left-to-right multiplier: Adds the MSB partial 4.7. Pipelining products first, which results in lower activity in the entire adder array [3]. However, array multip- Pipelining of datapaths is supported indirectly. All liers are generally much worse in terms of - the pipeline registers must be placed at the inputs or ing activity and dynamic power compared to tree outputs of the block in the RTL code. The datapath is multipliers because of the much longer signal then synthesized as a combinational block with a re- paths and the ripple structure. Since synthesis laxed timing constraint that is proportional to the num- usually generates adder trees, synthesized multip- ber of pipeline stages. Finally the pipeline registers are liers are already more power efficient than any moved to the optimal locations inside the datapath kind of array multiplier except for very extreme through register retiming. activity distributions. 4.8. Carry-Save Registers  Disable MSB logic: If upper bits do not change, the upper portion of a multiplier can be shut off Where a datapath is split by a register (i.e., arith- [4]. Efficient transistor-level circuits exist for dy- metic operations on both sides of a register) synthesis namically detecting signal inactivity, but with can extract the first datapath block with carry-save out- standard-cells the logic is very expensive. Also, put and the second with carry-save input, so that the latches would be needed to store previous values, register between them stores an intermediate result in which is problematic in synthesis.

209

The optimization that was eventually implemented adder tree does not produce much additional glitching, is to balance the adder tree based on input transition but it can propagate incoming glitches all the way probabilities instead of input timing (transition- through due to the non-monotonic nature of the com- probability-driven adder tree construction). There- pressor cells (XOR-based). Thus the adder tree can by low activity inputs enter the tree early and high ac- actually be the largest contributor to glitching power. tivity inputs late. In the extreme case of monotonically Glitching power in non-Booth multipliers can range increasing activities from LSB to MSB a structure simi- from 10% to 30%, depending on multiplier size and lar to the left-to-right multiplier is generated. This ap- delay constraints, while for radix-4 Booth multipliers proach is very flexible and it guarantees an adder tree the range is 15% to 60% and for radix-8 Booth multip- with lowest possible overall activity for arbitrary input liers 40% to 70%. Therefore glitching must be taken activity profiles. into account when selecting an optimal architecture for lowest possible power. 5.2.2. Transition-Probability-Based Operand Selection 5.3.3. Special Cells In Booth multipliers the input that is Booth encoded Complex special cells, such as 4:2 compressors drives extra logic. Power can be saved by Booth- and Booth encoder/selector cells, can reduce glitching encoding the input with lower average activity. This in two ways: a) well balanced path delays can reduce operand selection is done statically during synthesis. glitch generation and b) long inertial delays can filter Approaches to dynamically switch operands based on out incoming glitches. Synthesis makes use of such actual activity have also been proposed but are again cells if available in the library. rejected here due to the too large circuit overhead for A more aggressive way to filter out glitches is to activity detection in cell-based design. use pass-gate or pass-transistor compressor cells. Multiple cells in series act like a low-pass filter, but 5.3. Glitching Reduction their long delay makes then suitable only for low- frequency applications. However, pass-gate logic is Glitching power, caused by spurious transitions, can not compatible with current standard-cell methodolo- be as high as 60% of total dynamic power in datapaths, gies, so this strategy is not considered in synthesis. especially when multipliers are involved. Glitching reduction is therefore considered as the most promising 5.3.4. NAND-Based Multiplier way to reduce dynamic power in datapaths. Brown and Baugh-Wooley multipliers use nm AND-gates to generate the partial-product bits, which 5.3.1. Delay Balancing are implemented with INVERT-NOR structures for Glitches are produced when inputs into gates are better results (nm inverters, nm NOR-gates). An skewed, i.e., an output transition caused by a transition alternative implementation uses nm NAND-gates to on an early input is undone later by a transition on a generate inverted partial-product bits, which can be late input. Glitch generation and propagation is largest added up with a slightly adapted adder tree to generate in non-monotonic gates, like XORs, where each input an inverted carry-save sum. While this architecture transition causes an output transition. Preventing input does not always give better timing or area, depending skews through delay balancing can very effectively on the need for input buffering and output inversion, it reduce the amount of produced glitches. It is often often helps reduce glitching because of the absence (of sufficient to do it in specific glitch-prone locations only possibly unbalanced) input inverters. (like Booth encoding/selection), limiting the negative impact of buffer insertion and gate sizing. However, 5.4. Shifter Optimization effective delay balancing needs to be done in the back- end (after place-and-route) because only then the final Dynamic power in shifters is reduced by replacing delays are known. It is interesting to note that glitching the multiplexers of a conventional design with demul- is usually lower in circuits optimized for speed because tiplexers, as described in [5]. The resulting structure skews are smaller when the structure is as parallel as has lower overall wire load and reduced transition possible and all gates are sized up for optimal speed. probabilities on the long wires, which both help reduce power dissipation. 5.3.2. Architecture Selection Not all architectures are equally prone to glitching. 5.5. Techniques Incompatible with Cell-Based It is observed that in multipliers most glitching origi- Design nates from partial-product generation, especially in Booth multipliers where signal skews are big because Other techniques for reducing dynamic power in of unbalanced signal paths. A well balanced carry-save arithmetic circuits that are found in the literature are

210

not compatible with cell-based design and synthesis for reduce power further. Most effective are strategies that different reasons. These include: help lower glitching power or turn off unused blocks.  Exchange multiplier inputs or form 2’s comple- ment of multiplier inputs if this results in lower 7. Outlook transition probabilities at the multiplier inputs [6]. For future work, we see potential for improvements The special circuits that are required can be im- in the following areas: plemented efficiently at the transistor level, but not with standard-cells.  Add support for more special arithmetic opera- tions, improve datapath extraction and include  Bypass rows/columns in array multipliers where more arithmetic optimizations. partial products are 0 [7]. This technique applies to array multipliers and requires latches and mul-  Have more flexible selection of number represen- tiplexers, which only at the transistor-level can be tations inside datapath blocks, such as partial bi- efficiently integrated into the full-adder circuit. nary reduction of carry-save results (i.e., propa- gate carries only for small sections).  Glitch gating suppresses the propagation of  Optimize CSE sharing/unsharing for given timing. glitches at certain locations in a combinational block by inserting latches or isolation gates that  Investigate more low-power architectures and are driven by a delayed clock [8]. The required techniques for cell-based design, also for glitch latches and delay lines are not compatible with a reduction. Investigate the use of sign-magnitude synthesis flow. arithmetic.  Logic styles, such as pass-transistor logic, can be 8. References useful to reduce power because of reduced number of transistors (capacitive load) and through glitch [1] R. Zimmermann and D. Q. Tran, “Optimized Synthesis filtering [9], but they are again not feasible in of Sum-of-Products”, 37th Asilomar Conf. on Signals, standard-cells. Systems and Computers, Nov 9-12, 2003. [2] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method 6. Summary for Speed Optimized Partial Product Reduction and Gen- eration of Fast Parallel Multipliers Using an Algorithmic The extraction of arithmetic operations from RTL Approach”, IEEE Trans. on Computers, vol. 45, no. 3, into large datapath blocks is key to enabling high-level March 1996. arithmetic optimizations and to efficient implementa- [3] Z. Huang and M. D. Ercegovac, “High-Performance tion at the circuit level through exploiting redundant Low-Power Left-to-Right Array Multiplier Design”, number representations and reducing expensive carry IEEE Trans. on Computers, vol. 54, no. 3, March 2005. propagations to a minimum. [4] K.-H. Chen, Y.-S. Chu, Y.-M. Chen, and J.-I. Guo, “A For the synthesis of sum-of-products the different High-Speed/Low-Power Multiplier Using an Advanced Spurious Power Suppression Technique”, ISCAS 2007: architectures to generate partial products for multipliers IEEE Intl. Symp. on Circuits and Systems, May 2007. all prove to be valuable alternatives to optimize a data- path for a given context. The timing-driven or transi- [5] H. Zhu, Y. Zhu, C.-K. Cheng, and D. M. Harris, “An Interconnect-Centric Approach to Cyclic Shifter Design tion-probability-driven adder tree construction results Using Fanout Splitting and Cell Order Optimization”, in optimal carry-save adders for delay, area and power ASP-DAC’07, Asia and South Pacific Design Automa- under arbitrary timing and activity input profiles. The tion Conference, Jan 23-26, 2007. flexible parallel-prefix adder with timing-driven prefix [6] P.-M. Seidel, “Dynamic Operand Modification for Re- structure optimization is the architecture of choice for duced Power Multiplication”, 36th Asilomar Conf. on the implementation of carry-propagate adders and Signals, Systems and Computers, vol. 1, Nov 2002. comparators under arbitrary timing constraints. [7] K.-C. Kuo and C.-W. Chou, “Low Power Multiplier with For the reduction of dynamic power dissipation in Bypassing and Tree Structure”, IEEE Asia Pacific Conf. datapaths many techniques proposed in the literature on Circuits and Systems, APCCAS 2006, Dec 2006. are tailored towards full-custom and transistor-level [8] K. Chong, B.-H. Gwee, and J. S. Chang, “A Micropower design and therefore are not applicable in cell-based Low-Voltage Multiplier With Reduced Spurious Switch- design. On the other hand, the techniques already in ing”, IEEE Trans. on Very Large Scale Integration place to optimize datapath circuits for area and timing (VLSI) Systems, vol. 13, no. 2, Feb 2005. also help improve power because smaller area means [9] V. G. Moshnyaga and K. Tamaru, “A Comparative less power and the inherent parallelism effectively re- Study of Switching Activity Reduction Techniques for duces switching activity and glitching. Only few addi- Design of Low-Power Multipliers”, IEEE Intl. Symp. on tional techniques were found that can incrementally Circuits and Systems, ISCAS '95, vol. 3, May 1995.

211