Arithmetic Optimization using Carry-Save-Adders

y z y

Taewhan Kim William Jao Steve Tjiang

y z

Synopsys Inc. Aimfast Corp.

700 E. Middle eld Rd. 846 Stewart Dr.

Mountain View, CA 94043 Sunnyvale, CA 94086

Abstract The n- CSA consists of n disjoint full addersFAs.

It consumes three n-bit input vectors and pro duces

Carry-save-CSA is the most often used typ e

two outputs, i.e., n-bit sum vector S and n-bit carry

of op eration in implementing a fast computation of

vector C .We use the blo ck symb ol in Figure 1b to

arithmetics of register-transfer level design in indus-

represent a CSA op eration. Unlike the normal adders

try. This pap er establishes a relationship b etween the

e.g., ripple-carry adderRCA and carry-lo okahead

prop erties of arithmetic computations and several op-

adderCLA, a CSA contains no carry propagation.

timizing transformations using CSAs to derive consis-

Consequently, the CSA has the same propagation de-

tently b etter qualities of results than those of manual

lay as only one FA delay compared to RCA's n FA

implementations. In particular, weintro duce two im-

delay where n is the bit-width, and the delay is con-

p ortant concepts, operation-duplication and operation-

stant for anyvalue of n.For suciently large n, the

split, which are the main driving techniques of our al-

CSA implementation b ecomes much faster and also

gorithm for achieving an extensive utilization of CSAs.

relatively smaller in size than the implementation of

Exp erimental results from a set of typical arithmetic

1

normal adders.

computations found in industry designs indicate that

automating CSA optimization with our algorithm pro-

There has b een an extensive researchwork on

duces designs with signi cantly faster timing and less

the arithmetic optimizations in several areas[2, 3, 4].

area.

They,however, fo cused on the transformation of op er-

ations using techniques such as simple algebraic ma-

1 Intro duction

nipulations and constant propagation; they did not

address the problem of arithmetic optimization us-

Hardware designers have long applied many arith-

ing CSAs. This pap er intro duces the concept of CSA

metic optimization techniques for implementation

transformation and presents an algorithm that e ec-

of arithmetic functionality like addition, subtrac-

tively utilizes CSAs to derive consistently b etter qual-

tion, and multiplication. Among them, carry-save-

ity of results than that of manual implementations.

adderCSA[1] has proved a p owerful mechanism to

improve timing with little, if any, area p enalty even

We de ne a CSA tree to b e a tree of CSA op erators

reduced area. Figure 1a shows the structure of an

and one adder at the ro ot of the tree. A CSA tree

n-bit CSA.

can b e used to transform an arbitrary numb er of ad-

ditions to pro duce two adding op erands and the adder

Xn-1 YZn-1 n-1 Xn-2 Yn-2 Z n-2 X00Y0 ZCI

is used at the ro ot of CSA tree to pro duce a nal sum.

In other words, an expression of N additions can b e

transformed to a CSA tree of depth log N  and the

:5

FA FA FA 1

overall delayislog N  plus the delay of nal adder

1:5

whichisaboutlog n for a CLA. For example, ex- 2

CO Sn-1 Sn-2S 0

A+B +C can b e transformed into one CSA C pression

n-1 (a) C1 C 0 wn in Figure 1c.

ABC and one adder as sho

eral results of our exp erimentations, XYZ Based on sev

CI

our transformation can b e stated as follows: Given

a non-cyclic data ow graph of arithmetic op erations,

we wish to transform the computations using as many

CO CSA op erations as p ossible while preserving the func-

CS y of the design. (b) tionalit F

(c)

1

Figure 1: An n-bit CSA and an example of use

Note that a CSA p erforms the same functionality of the

conventional adders in the sense that each reduces the number

of adding op erands by one, i.e., a CSA reduces from 3 to 2 and an adder from 2 to 1.

35th Design Automation Conference ® Copyright ©1998 ACM 1-58113-049-x-98/0006/$3.50 DAC98 - 06/98 San Francisco, CA USA

2 ApplicabilityofTransformation transformed yet in the previous iterations. We refer

the op eration to root. The ro ot is then expanded to-

CSA transformation is not limited to addition only.

ward the input b oundary of design to construct a tree

We can transform other arithmetic op erations like sub-

of op erations. Note that the typ e of ro ot must b e

traction and multiplicationinto additions to pro duce

one of addition, subtraction, and multiplication op er-

longer chains of additions. We replace a subtraction

2

ations.

by adding the negation of the subtraction. That is,

Supp ose that op eration A is a leaf of the op eration

x-y  in expression is changed into x +y+ 1. For

tree expanded so far and B is an op eration in which

multiplication, we can use two p ossible options:

its output is used as an input of op eration A. We

expand the current op eration tree by including B if

 sum of pro ducts:

the following four conditions are satis ed:

Amultiplication is decomp osed into a set of shift-

and-add op erations, which is de nitely b ene cial

 Condition 1: A must b e one of addition and sub-

3

when it has a constant as input. With full de-

traction op erations.

comp osition we can e ectively extend the addi-

tion tree by merging the decomp osed additions

 Condition 2: B must b e one of addition, subtrac-

with another additions. However, decomp osition

tion, and multiplication op erations.

can increase the numb er of CSA op erations dras-

tically as the bit-width increases.

 Condition 3: Output of B must b e single fanout.

That is, B drives only an input of A.

 partial multiplication and addition:

 Condition 4: There should b e no \leak" of data

Twotyp es of implementation for multiplication

values through the connections from B to A up-to

op eration are typically found in designs[6]: i

the output bit-width of ro ot of the tree.

wallace tree mo del for fast timing and ii carry-

save array mo del for small area. The wallace tree

4

Condition 1 ensures that only addition and sub-

mo del consists of two blo cks called partial-mult

traction are used as non-leaf op erations of the tree,

and nal-add.Partial-mult uses the two inputs

and Condition 2 ensures that only addition, subtrac-

of the multiplication as input and pro duces two

tion and multiplication are used as leaf op erations.

outputs, in whichby adding them the nal result

5

Condition 3 ensures that we do not allow transform-

of multiplication is obtained. Consequently,by

ing non-tree structure of op erations. We solve the

decomp osing into a partial-mult and a nal-add

multiple fanout case in Sec. 4.1. Finally, Condition

we can merge the nal-add with a descendent op-

4 is required for preserving the functionality of de-

eration to transform into CSAs. This option in-

sign b efore and after transformation. The examples

creases only one op eration, but is less exible than

shown in Figure 2 clarify the concept of leakage of

the case of full decomp osition.

data values. Supp ose op eration A is also the ro ot of

the current tree. Figure 2a shows a case of upp er-

3 An Algorithm

bit truncation b etween B and A. Because up-to 8

Our transformation algorithm consists of three ma-

of data values must b e preserved, the truncation

jor steps: 1 identi cation of op eration tree to b e

do es not allow merging op eration B with A. Simi-

transformed, 2 translation of the expression of the

larly, Figure 2b shows a case of lower-bit truncation

identi ed tree into an addition expression, and 3

between the two op erations. It also prevents merg-

conversion of the addition expression into a CSA tree.

ing the two op erations. However, Figure 2c shows

Given a non-cycle data ow graph of arithmetic com-

a case of preserved data values up-to 8 bits from B

putations, our algorithm iteratively p erforms the three

through A. Consequently, the two op eration can b e

steps until there is no candidate expression of tree to

merged and transformed into CSA op erations without

b e transformed. The following subsections describ e

changing the functionality of design.

the details of the three steps.

3.2 Conversion to Additions

3.1 Candidate Identi cation

With the identi ed cluster of op eration tree ob-

As explained in Sec. 2, the CSA transformation

tained from Step 1, this step converts the expression

can b e applied to anytyp e of op erations that can

of op eration tree consisting of additions, subtractions,

b e converted to additions. A candidate cluster is

and multiplications into a tree of additions. What we

basically a tree which contains op erations like addi-

are interested in this step is to extract all the op erands

tions/subtractions/multiplications. Our algorithm is

from the converted addition expression. Table 1 sum-

a b ottom-up from the output b oundary of design to-

marizes all p ossible conversion rules for basic op er-

ward the input b oundary approach. From the out-

ations. Note that there exist three di erent rules to

puts of design it nds an op eration which has not b een

convertamultiplication. Which rule to apply dep ends

2 on constraints suchasnumb er of op erations to allow,

We used two's complement.

3

word size of op eration, and optimization goals. For

We used the signed-digitSD enco ding scheme[5] for the

example, rule 6 full decomp osition decreases tim-

multiplication with a constant input to reduce the number of

ing, but increases the numb er of CSAs drastically.On

decomp osed op erations.

4

the other hands, rule 5 partial decomp osition is not

It is a functionalityofwallace tree compressor.

5

much e ective to decrease timing than that of rule

We use Spartial multA,B and Cpartial multA,B to

denote the two outputs of the partial-mult of op eration A  B . 6, but maintains the numb er of CSAs in a certain

the p ossible triples of op erands in the set, the triple

We use an ecient

77 88 10 10 with the earlist arriving time.

linear-time heuristic: Initially we sort op erands in the yto

B set in a non-decreasing order according to the dela

.Iftwo op erands B B the op erands from input b oundary

8 8 8

have the same delay,we give a higher prioritytothe

8 9 11 8

t each iteration, our algo- 7 1 one of smaller bit-width. A

1 8 3

rithm picks the rst three op erands from the sorted list

ys to two output op erands

A A A and creates a CSA. The dela of CSA are then computed and the op erands are in-

8 8 8 serted to the correct p ositions of the sorted list. Our

exp erimentation indicates that our op erand selection

hisvery ecient and yet, do es not degrade

(a) upper-bit truncation (b) lower-bit trunction (c) preserved bit-flow approac y of results when compared with that of the

upto 8 bits the qualit

exhaustive search of selections. In addition, the algo-

Figure 2: Example of p ossible net-connections b e-

rithm can easily tune to optimize area of the resultant

tween two op erations

CSA tree by sorting the op erands according to the size

of bit-width of op erands. If two op erands have the

same bit-width we resolveitby giving higher priority

primitive exp. new exp.

to the one of shorter delay.

We handle one-bit op erands and a constant in dif-

1 x + y x + y

ferentways to reduce the numb er of CSAs created.

2 x + y with c x + y + c

in in

This can b e achieved by utilizing as many carry in-

3 x - y x +-y

puts of CSAs as p ossible. Each CSA created can con-

4 x- y with c x +-y+c

in in

sume an one-bit op erand as carry input. Therefore, it

5 x * y C par tial mul tx; y  +

is desirable to assign as many one-bit op erands to the

S par tial mul tx; y 

carry inputs of CSAs as p ossible b ecause otherwise,

P

n1

each requires one additional CSA. On the other hand,

6 x * y AN D x; y  << i

i

i=0

for the constant op erand it can b e used as a multi-

7 x * const summing from SD enco ding[5 ]

bit op erand or b e decomp osed into logic-1 values to

b e used as carry input of CSAs. Using constantasa

multi-bit op erand requires only one additional CSA.

Table 1: Conversion rules to additions for primitive

However, when there are suciently large number of

expressions

CSAs for the multi-bit non-constant op erands, the

constant can b e assigned to the carry inputs of CSAs,

avoiding unnecessary creation of an additional CSA.

amount. In addition, rule 7 can b e used as a pre-

Furthermore, when there is a mixture of op erands of

pro cessing step which guarantees that the number of

non-constant single-bit and logic-1 values, b ecause

op erations is always within half of bit-size of the con-

logic-1 op erand takes zero delay, it is desirable to as-

stant.

sign the logic-1 op erands to CSAs which are far from

We apply the rules in Table 1 to the original ex-

the ro ot of nal CSA tree, and to assign the non-

pression of tree to obtain a set of op erands, some of

constant op erands to CSAs which are close to the ro ot

which might need to negated e.g. -y in subtraction

of tree to reduce the timing of the critical path of the

or shifted e.g. AN D x; y  << i in multiplication.

i

nal CSA tree.

The op erands may also include single-bit ones and

Figure 3 illustrates the e ects of di erent utiliza-

constants. We then replace negated op erands, e.g.,

tion of carry inputs of CSAs on the timing and area of

-x byx+1 and all constants are folded.

nal CSA tree. Supp ose that A, B , C , D are mult-bit

3.3 Conversion to CSAs

op erands, E is single-bit, and the constantvalue is 2.

We assume that the bit sizes of the multi-bit op erands

All the op erands obtained from Step 2 are to b e

are the same. Also, we assume wehave already com-

summed up by CSA tree. In this step, we build up

6

the tree. Our algorithm constructs CSAs iteratively

puted delay to the op erands as shown in Figure 3a.

one by one. Each iteration selects three op erands as

Figure 3 shows a set of p ossible transformations for the

input of a new CSA and two new op erands are cre- op erands. Figure 3a shows the case that the single-

ated from the CSA. At the next iteration, the three

bit op erand E and constant 2 are not used as carry

op erands used in the previous iteration are removed input of CSA. Consequently, four CSAs are created

from the op erand set and the new two op erands are

and timing of the nal tree is worse than the other

added to the set. Therefore, whenever one CSA is cre- cases. On the other hand, Figures 3b and 3c show

ated the size of op erand set is reduced by one. CSA

the cases that the single-bit op erand and constant are

construction pro ceeds until only two op erands in the used as carry input of CSA but not b oth, resp ectively.

set remain. Finally, a normal adder adds the two re-

Each of them requires three CSAs and timing of the

maining op erands to pro duce a nal output of the ex- CSA trees is b etter than the previous one, but still

pression.

there is a ro om to improve. Figures 3d and 3e

Because the primary ob jective of our transforma-

6

tion is to reduce timing, each iteration selects, from all We use notation D X  to denote the delay of op erand X .

show the cases that the carry inputs of CSA tree are CSAs, and the second iteration will identify tr ee and

2

maximally utilized. However, according to the way transform it into CSAs as show in Figure 4b. Conse-

of assigning single-bit op erand and constant there is quently,two CSAs and two adders one for each CSA

a big di erence in timing. One guideline is that we tree are used. Supp ose that a design contains a chain

need to assign the constant op erands i.e., logic-1 to of op erations in which n of them havemultiple fanout.

earlier steps than the single-bit ones during the top Then, our algorithm will identify n op eration trees

down pro cess of CSA tree construction. In fact, our having each op eration with multiple fanout as ro ot.

algorithm pro duced the one in Figure 3e which is the Consequently, n adders will b e allo cated on the chain

b est in terms of timing and area. of the transformed CSA trees. Here, our ob jectiveis

that wewant to reduce the numb er of adders from n

to 1 and replace the remaining n-1 adders with CSAs.

ABC DE2 ABC ABC

We accomplish this byintro ducing the concept of CSA

transformation without nal adder.

D D

We use the algorithm in Sec. 3 for identifying ex-

pression tree. However, when an op eration with multi-

ple fanout is encountered i.e., Condition 3, we mark

2 E

the op eration as a ro ot of another op eration tree and

continue to expand the op eration tree. Once we collect

D(A) = 0, 1

y crossing over op erations with

D(B) = 0, all the op eration trees b

multiple fanout, we top ologically sort the op eration D(C) = 0, E 1 trees from input b oundary of design to output b ound-

D(D) = 0,

ary.We then transform the op eration trees on the list

y one. For those trees with multiple fanout ro ot,

D(E) = 2. one b

we do not allo cate nal adders in the resultant CSA

e generate nal two outputs of each F F F trees. Instead, w

(a) (b) (c)

CSA tree. The two outputs are then used in twoways: ABC ABC

E 1 1 Both of them are used as input op erand of the op-

erations trees which dep end on the op eration tree cor-

D D resp onding to the transformed CSA tree without nal

1 1 adder and 2 A new adder is created and they are

used as input of the adder. we refer this pro cess to

as operation-duplication. The output of adder is then

1 E used for the fanout of the ro ot of op eration tree cor-

resp onding to the CSA tree. Consequently, only one

adder will b e created at the end of the last op eration tree of the sorted list of trees.

F F or example, in Figure 4a the rst iteration will

(d) (e) F

nd two op eration trees tr ee and tr ee in which there

1 2

is a data ow from tr ee to tr ee . tr ee is then trans-

Figure 3: E ects of utilization of carry inputs on CSA

2 1 2

formed into a CSA tree without nal adder. It gener-

tree

ates two outputs T 1 and T 2 and one new adder O

dup

4 Extension to the Applicability of

as shown in Figure 4c. tr ee is then transformed

1

into a CSA tree with nal adder by using T 1, T 2, D ,

CSA Transformation

and E as op erands as shown in Figure 4d. Note that

Many signal pro cessings and data-intensive compu-

the resultant transformation contains three CSAs and

tations frequently contain an op eration in which only

one adder in the two CSA trees and generates a faster

the upp er m bits among n bits of its output are fed in-

timing than the one in Figure 4b. However, the total

put to a descendent op eration. We call this partial use

area increases b ecause one additional adder is created

of an output the lower-bit truncation problem. More-

outside of the CSA trees.

over, many designs often contain an op eration whose

output feeds several other op erations. We call this

4.2 A Solution to the Lower-bit Trunca-

the multiple fanout problem. In fact, the two prob-

tion Problem

lems corresp ond to the violations of Condition 4 Fig-

We also b egin with an example to demonstrate how

ure 2b and Condition 3 in Sec. 3.1, resp ectively.In

our approach can solve the problem of lower-bit trun-

the following, we provide solutions to the problems.

cation. Figure 5a shows an example of lower-bit

4.1 A Solution to the Multiple Fanout

truncation. Note that only 8 upp er-bits of the out-

Problem

put of op eration O are used. Therefore, we cannot

2

We b egin with an example to demonstrate how merge op eration O and O into CSAs b ecause there

1 2

our approach handles op erations with multiple fanout. is a data leakage b etween the op erationsi.e., Condi-

Supp ose that Figure 4a is a partial structure of a de- tion 4 in Sec. 3.1. We solve this problem byintro-

sign that we are going to transform into CSAs. Note ducing the concept of operation-split:We split op er-

that op eration O has multiple fanout. According to ation O into two op erations O and O as shown in

3 2 3 4

the algorithm describ ed in Sec. 3, the rst iteration Figure 5b. This means that given the split op era-

will identify op eration tree tr ee and transform it into tions our algorithm can identify an expression tree as 1 ABC A B A A7-0 B7-0 B AB15-8 15-8 8 C 16 8 16 88 tree 2 O4 tree 2 O4 O2 1 C O3 16 8 O3 8 8 8 D D 8 C E tree 1 8

E tree 1 O2 O1 O1 F F O 1 (a) (b) F F (a) (b) A7-0 B7-0 AB15-8 15-8 ABC 88 A B C A7-0 B7-0 88 A 15-8 B15-8 8 8 tree 2 O4 C tree 2 1 O3 E 8 O4 O dup T1 T2 1 tree Odup 8 C 8 D 1 D 8

E

tree O1 1 F F

F (c) (d)

(c) F (d)

Figure 5: An example of transformation of op eration

Figure 4: Examples of transformation of op eration

with lower-bit truncation

with multiple fanout

for non-constant inputs of multiplication and used

shown in Figure 5c where C , upp er 8-bits of A and

16-bit op erands for the rest. We also assume that

B , and carry-out of op eration O are the op erands to

4

the arrival times of all input op erands are 0. For

b e added. Figure 5d shows the transformed CSA

multiplication op eration, our algorithm fully de-

tree from the op eration tree in Figure 5c.

comp osed it into additions. The results in Table 2

Unlike the case of solution of multiple fanout prob-

show a strong indication that our algorithm can

lem, the op eration-split do es not increase area. Fur-

extensively apply to the designs with additions,

thermore, when the op eration with lower-bit trunca-

subtractions and multiplications.

tion has multiple fanout, the carry-out of lower one of

two split op erations can b e extracted directly from the  Designs with multiple fanout:

adder created from the solution of the multiple fanout

We conducted our exp erimentation on three de-

problem.

signs shown in Figure 6. In all designs in the

following, we assume that the arrival times of all

5 Exp erimental Results

input op erands are 0. Each design has op era-

We tested our algorithm on a large numb er of arith-

tions with multiple fanout. We used two CSA

metic computations which are typically used in indus-

transformation techniques: i without op eration-

trial designs. We used Design Compiler package from

duplication and ii with op eration-duplication,

Synopsys Inc. to p erform the implementation selec-

and compared the results with that pro duced

tion, tree-height minimization and logic optimizations

without CSA transformation. The results are

for the designs transformed by our algorithm and the

summarized in Table 3. Note that op eration-

7

designs without using our algorithm , and compared

duplication can reduce timing of design much fur-

their results.

ther, but it increases area. Consequently, the

transformation with op eration-duplication can b e

 Designs with additions, subtractions, and

applied in a lo cal way to those op erations on the

multiplications:

critical path of design to reduce timing with a

We tested our algorithm on designs with a mix-

minimal increase of area.

ture of additions, subtractions, and multiplica-

tions as shown in Table 2. We used 8-bit op erands

 Designs with lower-bit truncation:

7

We also conducted our exp erimentation on the

The tree-height minimization was p erformed on non-CSA

three designs shown in Figure 7. Note that the

op erations. We used lcbg10pv 0.35 technology[7] for all test

designs have op erations with lower-bit truncation. cases.

We used two CSA transformation techniques: i Acknow ledgment- We wish to thank Ron Miller,

without op eration-split and ii with op eration- Hazem Almusa, Reiner Genevriere, Sean Huang, and

split, and compared the results with that pro- Tai Ly for making this work successful. duced without CSA transformation. The results

16 16 8 8

shown in Table 4 re ect that the op eration-split is

8 8 8 8

avery p owerful technique to overcome the limita-

tion of lower-bit truncation problem and extend

16 16 16 16 y of CSA transformation to pro- the applicabilit 16 16

16 16 16 16 duce much faster timing and less area.

RTL Design CSA design di . Equations 16 16

16 16 16 16 timing/area

timing/area 16 16

A*B+1 6.18 ns 5.88 ns -5

3410 units -1 3430 units 16 16 16

16 16 16

A * 3E3E 6.86 ns 6.55 ns -5

0011111000111110 2920 units 2068 units -29

16

7.21 ns 6.84 ns -37

A*B + C*D 16 16

11225 units 10141 units -10

+ E*F fanout_1 fanout_2 fanout_3

A*B - C + D 7.19 ns 6.37 ns -10

Figure 6: Designs with multiple fanout

5185 units 4880 units -6

A+B-C- 7.94 ns 6.03 ns -24

D-E-F 5328 units 3697 units -31 16 16 8 8

8 8 8 8

Table 2: Comparison of results for expressions with

addition, subtraction and multiplication

4 12 4

12 12 12 4 12 4

TL Design CSA design CSA Design

R 12

Designs w/o duplicate w/ duplicate

timing/area timing/area timing/area 4 4

8 8 12

1 7.11 ns 6.83 ns 6.24 ns

fanout 8 8 12

4010 units 3600 units 4659 units

fanout 2 7.96 ns 7.13 ns 7.04 ns

8 8 12 12

6072 units 7037 units

6328 units 8 8

fanout 3 7.86 ns 7.07 ns 6.87 ns

9344 units 8992 units 9772 units 12

8 8

Table 3: Comparison of results for designs in Figure 6

truncate_1 truncate_2 truncate_3

RTL Design CSA design CSA Design

Figure 7: Designs with lower-bit truncation

Designs w/o split w/ split

timing/area timing/area timing/area

References

truncate 1 6.01 ns 5.94 ns 4.02 ns

[1] N. Weste and K. Eshraghian, Principles of CMOS

2315 units 2138 units 1721 units

VLSI Design - A Systems Perspective, Addition-

truncate 2 7.75 ns 7.25 ns 5.81 ns

Wesley Publishers, 1985.

4610 units 4560 units 4133 units

[2] M. Potkonjak and J. Rabaey, \Optimizing Resource

truncate 3 7.98 ns 7.25 ns 6.45 ns Utilization Using Transformations", Proc. of ICCAD,

8374 units 8161 units 7405 units

pp. 88-91, 1991.

[3] D. Lob o and B. Pangrle, \Redundant Op eration Cre-

Table 4: Comparison of results for designs in Figure 7

ation: A Scheduling Optimization Technique", Proc.

of DAC, pp. 775-778, 1991.

6 Conclusions

[4] A. Aho and J. Ullman, Principles of Compiler Design,

We presented a new technique to optimize arith-

Addition-Wesle y Publishers, 1977.

metic circuits. The technique automatically expands

[5] K. Hwang, Computer Arithmetic: Principles, archi-

circuits consisting of adders, , and multipli-

tecture, and Design, New York, 1979.

ers into their carry-save-adder CSA representation

[6] Synopsys Inc., DesignWare Components Databook,

which are then optimized. The representation obvi-

1996.

ates the need for implementation selection, the au-

[7] LSI Logic Inc., G10-p Cel l-Based ASIC Products

tomatic selection of the b est implementation among

several adder//multiplier implementations,

Databook, 1996. a time-consuming part of arithmetic optimization.