Arithmetic Optimization using Carry-Save-Adders
y z y
Taewhan Kim William Jao Steve Tjiang
y z
Synopsys Inc. Aimfast Corp.
700 E. Middle eld Rd. 846 Stewart Dr.
Mountain View, CA 94043 Sunnyvale, CA 94086
Abstract The n-bit CSA consists of n disjoint full addersFAs.
It consumes three n-bit input vectors and pro duces
Carry-save-adderCSA is the most often used typ e
two outputs, i.e., n-bit sum vector S and n-bit carry
of op eration in implementing a fast computation of
vector C .We use the blo ck symb ol in Figure 1b to
arithmetics of register-transfer level design in indus-
represent a CSA op eration. Unlike the normal adders
try. This pap er establishes a relationship b etween the
e.g., ripple-carry adderRCA and carry-lo okahead
prop erties of arithmetic computations and several op-
adderCLA, a CSA contains no carry propagation.
timizing transformations using CSAs to derive consis-
Consequently, the CSA has the same propagation de-
tently b etter qualities of results than those of manual
lay as only one FA delay compared to RCA's n FA
implementations. In particular, weintro duce two im-
delay where n is the bit-width, and the delay is con-
p ortant concepts, operation-duplication and operation-
stant for anyvalue of n.For suciently large n, the
split, which are the main driving techniques of our al-
CSA implementation b ecomes much faster and also
gorithm for achieving an extensive utilization of CSAs.
relatively smaller in size than the implementation of
Exp erimental results from a set of typical arithmetic
1
normal adders.
computations found in industry designs indicate that
automating CSA optimization with our algorithm pro-
There has b een an extensive researchwork on
duces designs with signi cantly faster timing and less
the arithmetic optimizations in several areas[2, 3, 4].
area.
They,however, fo cused on the transformation of op er-
ations using techniques such as simple algebraic ma-
1 Intro duction
nipulations and constant propagation; they did not
address the problem of arithmetic optimization us-
Hardware designers have long applied many arith-
ing CSAs. This pap er intro duces the concept of CSA
metic optimization techniques for implementation
transformation and presents an algorithm that e ec-
of arithmetic functionality like addition, subtrac-
tively utilizes CSAs to derive consistently b etter qual-
tion, and multiplication. Among them, carry-save-
ity of results than that of manual implementations.
adderCSA[1] has proved a p owerful mechanism to
improve timing with little, if any, area p enalty even
We de ne a CSA tree to b e a tree of CSA op erators
reduced area. Figure 1a shows the structure of an
and one adder at the ro ot of the tree. A CSA tree
n-bit CSA.
can b e used to transform an arbitrary numb er of ad-
ditions to pro duce two adding op erands and the adder
Xn-1 YZn-1 n-1 Xn-2 Yn-2 Z n-2 X00Y0 ZCI
is used at the ro ot of CSA tree to pro duce a nal sum.
In other words, an expression of N additions can b e
transformed to a CSA tree of depth log N and the
:5
FA FA FA 1
overall delayislog N plus the delay of nal adder
1:5
whichisaboutlog n for a CLA. For example, ex- 2
CO Sn-1 Sn-2S 0
A+B +C can b e transformed into one CSA C pression
n-1 (a) C1 C 0 wn in Figure 1c.
ABC and one adder as sho
eral results of our exp erimentations, XYZ Based on sev
CI
our transformation can b e stated as follows: Given
a non-cyclic data ow graph of arithmetic op erations,
we wish to transform the computations using as many
CO CSA op erations as p ossible while preserving the func-
CS y of the design. (b) tionalit F
(c)
1
Figure 1: An n-bit CSA and an example of use
Note that a CSA p erforms the same functionality of the
conventional adders in the sense that each reduces the number
of adding op erands by one, i.e., a CSA reduces from 3 to 2 and an adder from 2 to 1.
35th Design Automation Conference ® Copyright ©1998 ACM 1-58113-049-x-98/0006/$3.50 DAC98 - 06/98 San Francisco, CA USA
2 ApplicabilityofTransformation transformed yet in the previous iterations. We refer
the op eration to root. The ro ot is then expanded to-
CSA transformation is not limited to addition only.
ward the input b oundary of design to construct a tree
We can transform other arithmetic op erations like sub-
of op erations. Note that the typ e of ro ot must b e
traction and multiplicationinto additions to pro duce
one of addition, subtraction, and multiplication op er-
longer chains of additions. We replace a subtraction
2
ations.
by adding the negation of the subtraction. That is,
Supp ose that op eration A is a leaf of the op eration
x-y in expression is changed into x +y+ 1. For
tree expanded so far and B is an op eration in which
multiplication, we can use two p ossible options:
its output is used as an input of op eration A. We
expand the current op eration tree by including B if
sum of pro ducts:
the following four conditions are satis ed:
Amultiplication is decomp osed into a set of shift-
and-add op erations, which is de nitely b ene cial
Condition 1: A must b e one of addition and sub-
3
when it has a constant as input. With full de-
traction op erations.
comp osition we can e ectively extend the addi-
tion tree by merging the decomp osed additions
Condition 2: B must b e one of addition, subtrac-
with another additions. However, decomp osition
tion, and multiplication op erations.
can increase the numb er of CSA op erations dras-
tically as the bit-width increases.
Condition 3: Output of B must b e single fanout.
That is, B drives only an input of A.
partial multiplication and addition:
Condition 4: There should b e no \leak" of data
Twotyp es of implementation for multiplication
values through the connections from B to A up-to
op eration are typically found in designs[6]: i
the output bit-width of ro ot of the tree.
wallace tree mo del for fast timing and ii carry-
save array mo del for small area. The wallace tree
4
Condition 1 ensures that only addition and sub-
mo del consists of two blo cks called partial-mult
traction are used as non-leaf op erations of the tree,
and nal-add.Partial-mult uses the two inputs
and Condition 2 ensures that only addition, subtrac-
of the multiplication as input and pro duces two
tion and multiplication are used as leaf op erations.
outputs, in whichby adding them the nal result
5
Condition 3 ensures that we do not allow transform-
of multiplication is obtained. Consequently,by
ing non-tree structure of op erations. We solve the
decomp osing into a partial-mult and a nal-add
multiple fanout case in Sec. 4.1. Finally, Condition
we can merge the nal-add with a descendent op-
4 is required for preserving the functionality of de-
eration to transform into CSAs. This option in-
sign b efore and after transformation. The examples
creases only one op eration, but is less exible than
shown in Figure 2 clarify the concept of leakage of
the case of full decomp osition.
data values. Supp ose op eration A is also the ro ot of
the current tree. Figure 2a shows a case of upp er-
3 An Algorithm
bit truncation b etween B and A. Because up-to 8
Our transformation algorithm consists of three ma-
bits of data values must b e preserved, the truncation
jor steps: 1 identi cation of op eration tree to b e
do es not allow merging op eration B with A. Simi-
transformed, 2 translation of the expression of the
larly, Figure 2b shows a case of lower-bit truncation
identi ed tree into an addition expression, and 3
between the two op erations. It also prevents merg-
conversion of the addition expression into a CSA tree.
ing the two op erations. However, Figure 2c shows
Given a non-cycle data ow graph of arithmetic com-
a case of preserved data values up-to 8 bits from B
putations, our algorithm iteratively p erforms the three
through A. Consequently, the two op eration can b e
steps until there is no candidate expression of tree to
merged and transformed into CSA op erations without
b e transformed. The following subsections describ e
changing the functionality of design.
the details of the three steps.
3.2 Conversion to Additions
3.1 Candidate Identi cation
With the identi ed cluster of op eration tree ob-
As explained in Sec. 2, the CSA transformation
tained from Step 1, this step converts the expression
can b e applied to anytyp e of op erations that can
of op eration tree consisting of additions, subtractions,
b e converted to additions. A candidate cluster is
and multiplications into a tree of additions. What we
basically a tree which contains op erations like addi-
are interested in this step is to extract all the op erands
tions/subtractions/multiplications. Our algorithm is
from the converted addition expression. Table 1 sum-
a b ottom-up from the output b oundary of design to-
marizes all p ossible conversion rules for basic op er-
ward the input b oundary approach. From the out-
ations. Note that there exist three di erent rules to
puts of design it nds an op eration which has not b een
convertamultiplication. Which rule to apply dep ends
2 on constraints suchasnumb er of op erations to allow,
We used two's complement.
3
word size of op eration, and optimization goals. For
We used the signed-digitSD enco ding scheme[5] for the
example, rule 6 full decomp osition decreases tim-
multiplication with a constant input to reduce the number of
ing, but increases the numb er of CSAs drastically.On
decomp osed op erations.
4
the other hands, rule 5 partial decomp osition is not
It is a functionalityofwallace tree compressor.
5
much e ective to decrease timing than that of rule
We use Spartial multA,B and Cpartial multA,B to
denote the two outputs of the partial-mult of op eration A B . 6, but maintains the numb er of CSAs in a certain
the p ossible triples of op erands in the set, the triple
We use an ecient
77 88 10 10 with the earlist arriving time.
linear-time heuristic: Initially we sort op erands in the yto
B set in a non-decreasing order according to the dela
.Iftwo op erands B B the op erands from input b oundary
8 8 8
have the same delay,we give a higher prioritytothe
8 9 11 8
t each iteration, our algo- 7 1 one of smaller bit-width. A
1 8 3
rithm picks the rst three op erands from the sorted list
ys to two output op erands
A A A and creates a CSA. The dela of CSA are then computed and the op erands are in-
8 8 8 serted to the correct p ositions of the sorted list. Our
exp erimentation indicates that our op erand selection
hisvery ecient and yet, do es not degrade
(a) upper-bit truncation (b) lower-bit trunction (c) preserved bit-flow approac y of results when compared with that of the
upto 8 bits the qualit
exhaustive search of selections. In addition, the algo-
Figure 2: Example of p ossible net-connections b e-
rithm can easily tune to optimize area of the resultant
tween two op erations
CSA tree by sorting the op erands according to the size
of bit-width of op erands. If two op erands have the
same bit-width we resolveitby giving higher priority
primitive exp. new exp.
to the one of shorter delay.
We handle one-bit op erands and a constant in dif-
1 x + y x + y
ferentways to reduce the numb er of CSAs created.
2 x + y with c x + y + c
in in
This can b e achieved by utilizing as many carry in-
3 x - y x +-y
puts of CSAs as p ossible. Each CSA created can con-
4 x- y with c x +-y+c
in in
sume an one-bit op erand as carry input. Therefore, it
5 x * y C par tial mul tx; y +
is desirable to assign as many one-bit op erands to the
S par tial mul tx; y
carry inputs of CSAs as p ossible b ecause otherwise,
P
n 1
each requires one additional CSA. On the other hand,
6 x * y AN D x; y << i
i
i=0
for the constant op erand it can b e used as a multi-
7 x * const summing from SD enco ding[5 ]
bit op erand or b e decomp osed into logic-1 values to
b e used as carry input of CSAs. Using constantasa
multi-bit op erand requires only one additional CSA.
Table 1: Conversion rules to additions for primitive
However, when there are suciently large number of
expressions
CSAs for the multi-bit non-constant op erands, the
constant can b e assigned to the carry inputs of CSAs,
avoiding unnecessary creation of an additional CSA.
amount. In addition, rule 7 can b e used as a pre-
Furthermore, when there is a mixture of op erands of
pro cessing step which guarantees that the number of
non-constant single-bit and logic-1 values, b ecause
op erations is always within half of bit-size of the con-
logic-1 op erand takes zero delay, it is desirable to as-
stant.
sign the logic-1 op erands to CSAs which are far from
We apply the rules in Table 1 to the original ex-
the ro ot of nal CSA tree, and to assign the non-
pression of tree to obtain a set of op erands, some of
constant op erands to CSAs which are close to the ro ot
which might need to negated e.g. -y in subtraction
of tree to reduce the timing of the critical path of the
or shifted e.g. AN D x; y << i in multiplication.
i
nal CSA tree.
The op erands may also include single-bit ones and
Figure 3 illustrates the e ects of di erent utiliza-
constants. We then replace negated op erands, e.g.,
tion of carry inputs of CSAs on the timing and area of
-x byx+1 and all constants are folded.
nal CSA tree. Supp ose that A, B , C , D are mult-bit
3.3 Conversion to CSAs
op erands, E is single-bit, and the constantvalue is 2.
We assume that the bit sizes of the multi-bit op erands
All the op erands obtained from Step 2 are to b e
are the same. Also, we assume wehave already com-
summed up by CSA tree. In this step, we build up
6
the tree. Our algorithm constructs CSAs iteratively
puted delay to the op erands as shown in Figure 3a.
one by one. Each iteration selects three op erands as
Figure 3 shows a set of p ossible transformations for the
input of a new CSA and two new op erands are cre- op erands. Figure 3a shows the case that the single-
ated from the CSA. At the next iteration, the three
bit op erand E and constant 2 are not used as carry
op erands used in the previous iteration are removed input of CSA. Consequently, four CSAs are created
from the op erand set and the new two op erands are
and timing of the nal tree is worse than the other
added to the set. Therefore, whenever one CSA is cre- cases. On the other hand, Figures 3b and 3c show
ated the size of op erand set is reduced by one. CSA
the cases that the single-bit op erand and constant are
construction pro ceeds until only two op erands in the used as carry input of CSA but not b oth, resp ectively.
set remain. Finally, a normal adder adds the two re-
Each of them requires three CSAs and timing of the
maining op erands to pro duce a nal output of the ex- CSA trees is b etter than the previous one, but still
pression.
there is a ro om to improve. Figures 3d and 3e
Because the primary ob jective of our transforma-
6
tion is to reduce timing, each iteration selects, from all We use notation D X to denote the delay of op erand X .
show the cases that the carry inputs of CSA tree are CSAs, and the second iteration will identify tr ee and
2
maximally utilized. However, according to the way transform it into CSAs as show in Figure 4b. Conse-
of assigning single-bit op erand and constant there is quently,two CSAs and two adders one for each CSA
a big di erence in timing. One guideline is that we tree are used. Supp ose that a design contains a chain
need to assign the constant op erands i.e., logic-1 to of op erations in which n of them havemultiple fanout.
earlier steps than the single-bit ones during the top Then, our algorithm will identify n op eration trees
down pro cess of CSA tree construction. In fact, our having each op eration with multiple fanout as ro ot.
algorithm pro duced the one in Figure 3e which is the Consequently, n adders will b e allo cated on the chain
b est in terms of timing and area. of the transformed CSA trees. Here, our ob jectiveis
that wewant to reduce the numb er of adders from n
to 1 and replace the remaining n-1 adders with CSAs.
ABC DE2 ABC ABC
We accomplish this byintro ducing the concept of CSA
transformation without nal adder.
D D
We use the algorithm in Sec. 3 for identifying ex-
pression tree. However, when an op eration with multi-
ple fanout is encountered i.e., Condition 3, we mark
2 E
the op eration as a ro ot of another op eration tree and
continue to expand the op eration tree. Once we collect
D(A) = 0, 1
y crossing over op erations with
D(B) = 0, all the op eration trees b
multiple fanout, we top ologically sort the op eration D(C) = 0, E 1 trees from input b oundary of design to output b ound-
D(D) = 0,
ary.We then transform the op eration trees on the list
y one. For those trees with multiple fanout ro ot,
D(E) = 2. one b
we do not allo cate nal adders in the resultant CSA
e generate nal two outputs of each F F F trees. Instead, w
(a) (b) (c)
CSA tree. The two outputs are then used in twoways: ABC ABC
E 1 1 Both of them are used as input op erand of the op-
erations trees which dep end on the op eration tree cor-
D D resp onding to the transformed CSA tree without nal
1 1 adder and 2 A new adder is created and they are
used as input of the adder. we refer this pro cess to
as operation-duplication. The output of adder is then
1 E used for the fanout of the ro ot of op eration tree cor-
resp onding to the CSA tree. Consequently, only one
adder will b e created at the end of the last op eration tree of the sorted list of trees.
F F or example, in Figure 4a the rst iteration will
(d) (e) F
nd two op eration trees tr ee and tr ee in which there
1 2
is a data ow from tr ee to tr ee . tr ee is then trans-
Figure 3: E ects of utilization of carry inputs on CSA
2 1 2
formed into a CSA tree without nal adder. It gener-
tree
ates two outputs T 1 and T 2 and one new adder O
dup
4 Extension to the Applicability of
as shown in Figure 4c. tr ee is then transformed
1
into a CSA tree with nal adder by using T 1, T 2, D ,
CSA Transformation
and E as op erands as shown in Figure 4d. Note that
Many signal pro cessings and data-intensive compu-
the resultant transformation contains three CSAs and
tations frequently contain an op eration in which only
one adder in the two CSA trees and generates a faster
the upp er m bits among n bits of its output are fed in-
timing than the one in Figure 4b. However, the total
put to a descendent op eration. We call this partial use
area increases b ecause one additional adder is created
of an output the lower-bit truncation problem. More-
outside of the CSA trees.
over, many designs often contain an op eration whose
output feeds several other op erations. We call this
4.2 A Solution to the Lower-bit Trunca-
the multiple fanout problem. In fact, the two prob-
tion Problem
lems corresp ond to the violations of Condition 4 Fig-
We also b egin with an example to demonstrate how
ure 2b and Condition 3 in Sec. 3.1, resp ectively.In
our approach can solve the problem of lower-bit trun-
the following, we provide solutions to the problems.
cation. Figure 5a shows an example of lower-bit
4.1 A Solution to the Multiple Fanout
truncation. Note that only 8 upp er-bits of the out-
Problem
put of op eration O are used. Therefore, we cannot
2
We b egin with an example to demonstrate how merge op eration O and O into CSAs b ecause there
1 2
our approach handles op erations with multiple fanout. is a data leakage b etween the op erationsi.e., Condi-
Supp ose that Figure 4a is a partial structure of a de- tion 4 in Sec. 3.1. We solve this problem byintro-
sign that we are going to transform into CSAs. Note ducing the concept of operation-split:We split op er-
that op eration O has multiple fanout. According to ation O into two op erations O and O as shown in
3 2 3 4
the algorithm describ ed in Sec. 3, the rst iteration Figure 5b. This means that given the split op era-
will identify op eration tree tr ee and transform it into tions our algorithm can identify an expression tree as 1 ABC A B A A7-0 B7-0 B AB15-8 15-8 8 C 16 8 16 88 tree 2 O4 tree 2 O4 O2 1 C O3 16 8 O3 8 8 8 D D 8 C E tree 1 8
E tree 1 O2 O1 O1 F F O 1 (a) (b) F F (a) (b) A7-0 B7-0 AB15-8 15-8 ABC 88 A B C A7-0 B7-0 88 A 15-8 B15-8 8 8 tree 2 O4 C tree 2 1 O3 E 8 O4 O dup T1 T2 1 tree Odup 8 C 8 D 1 D 8
E
tree O1 1 F F
F (c) (d)
(c) F (d)
Figure 5: An example of transformation of op eration
Figure 4: Examples of transformation of op eration
with lower-bit truncation
with multiple fanout
for non-constant inputs of multiplication and used
shown in Figure 5c where C , upp er 8-bits of A and
16-bit op erands for the rest. We also assume that
B , and carry-out of op eration O are the op erands to
4
the arrival times of all input op erands are 0. For
b e added. Figure 5d shows the transformed CSA
multiplication op eration, our algorithm fully de-
tree from the op eration tree in Figure 5c.
comp osed it into additions. The results in Table 2
Unlike the case of solution of multiple fanout prob-
show a strong indication that our algorithm can
lem, the op eration-split do es not increase area. Fur-
extensively apply to the designs with additions,
thermore, when the op eration with lower-bit trunca-
subtractions and multiplications.
tion has multiple fanout, the carry-out of lower one of
two split op erations can b e extracted directly from the Designs with multiple fanout:
adder created from the solution of the multiple fanout
We conducted our exp erimentation on three de-
problem.
signs shown in Figure 6. In all designs in the
following, we assume that the arrival times of all
5 Exp erimental Results
input op erands are 0. Each design has op era-
We tested our algorithm on a large numb er of arith-
tions with multiple fanout. We used two CSA
metic computations which are typically used in indus-
transformation techniques: i without op eration-
trial designs. We used Design Compiler package from
duplication and ii with op eration-duplication,
Synopsys Inc. to p erform the implementation selec-
and compared the results with that pro duced
tion, tree-height minimization and logic optimizations
without CSA transformation. The results are
for the designs transformed by our algorithm and the
summarized in Table 3. Note that op eration-
7
designs without using our algorithm , and compared
duplication can reduce timing of design much fur-
their results.
ther, but it increases area. Consequently, the
transformation with op eration-duplication can b e
Designs with additions, subtractions, and
applied in a lo cal way to those op erations on the
multiplications:
critical path of design to reduce timing with a
We tested our algorithm on designs with a mix-
minimal increase of area.
ture of additions, subtractions, and multiplica-
tions as shown in Table 2. We used 8-bit op erands
Designs with lower-bit truncation:
7
We also conducted our exp erimentation on the
The tree-height minimization was p erformed on non-CSA
three designs shown in Figure 7. Note that the
op erations. We used lcbg10pv 0.35 technology[7] for all test
designs have op erations with lower-bit truncation. cases.
We used two CSA transformation techniques: i Acknow ledgment- We wish to thank Ron Miller,
without op eration-split and ii with op eration- Hazem Almusa, Reiner Genevriere, Sean Huang, and
split, and compared the results with that pro- Tai Ly for making this work successful. duced without CSA transformation. The results
16 16 8 8
shown in Table 4 re ect that the op eration-split is
8 8 8 8
avery p owerful technique to overcome the limita-
tion of lower-bit truncation problem and extend
16 16 16 16 y of CSA transformation to pro- the applicabilit 16 16
16 16 16 16 duce much faster timing and less area.
RTL Design CSA design di . Equations 16 16
16 16 16 16 timing/area
timing/area 16 16
A*B+1 6.18 ns 5.88 ns -5
3410 units -1 3430 units 16 16 16
16 16 16
A * 3E3E 6.86 ns 6.55 ns -5
0011111000111110 2920 units 2068 units -29
16
7.21 ns 6.84 ns -37
A*B + C*D 16 16
11225 units 10141 units -10
+ E*F fanout_1 fanout_2 fanout_3
A*B - C + D 7.19 ns 6.37 ns -10
Figure 6: Designs with multiple fanout
5185 units 4880 units -6
A+B-C- 7.94 ns 6.03 ns -24
D-E-F 5328 units 3697 units -31 16 16 8 8
8 8 8 8
Table 2: Comparison of results for expressions with
addition, subtraction and multiplication
4 12 4
12 12 12 4 12 4
TL Design CSA design CSA Design
R 12
Designs w/o duplicate w/ duplicate
timing/area timing/area timing/area 4 4
8 8 12
1 7.11 ns 6.83 ns 6.24 ns
fanout 8 8 12
4010 units 3600 units 4659 units
fanout 2 7.96 ns 7.13 ns 7.04 ns
8 8 12 12
6072 units 7037 units
6328 units 8 8
fanout 3 7.86 ns 7.07 ns 6.87 ns
9344 units 8992 units 9772 units 12
8 8
Table 3: Comparison of results for designs in Figure 6
truncate_1 truncate_2 truncate_3
RTL Design CSA design CSA Design
Figure 7: Designs with lower-bit truncation
Designs w/o split w/ split
timing/area timing/area timing/area
References
truncate 1 6.01 ns 5.94 ns 4.02 ns
[1] N. Weste and K. Eshraghian, Principles of CMOS
2315 units 2138 units 1721 units
VLSI Design - A Systems Perspective, Addition-
truncate 2 7.75 ns 7.25 ns 5.81 ns
Wesley Publishers, 1985.
4610 units 4560 units 4133 units
[2] M. Potkonjak and J. Rabaey, \Optimizing Resource
truncate 3 7.98 ns 7.25 ns 6.45 ns Utilization Using Transformations", Proc. of ICCAD,
8374 units 8161 units 7405 units
pp. 88-91, 1991.
[3] D. Lob o and B. Pangrle, \Redundant Op eration Cre-
Table 4: Comparison of results for designs in Figure 7
ation: A Scheduling Optimization Technique", Proc.
of DAC, pp. 775-778, 1991.
6 Conclusions
[4] A. Aho and J. Ullman, Principles of Compiler Design,
We presented a new technique to optimize arith-
Addition-Wesle y Publishers, 1977.
metic circuits. The technique automatically expands
[5] K. Hwang, Computer Arithmetic: Principles, archi-
circuits consisting of adders, subtractors, and multipli-
tecture, and Design, New York, 1979.
ers into their carry-save-adder CSA representation
[6] Synopsys Inc., DesignWare Components Databook,
which are then optimized. The representation obvi-
1996.
ates the need for implementation selection, the au-
[7] LSI Logic Inc., G10-p Cel l-Based ASIC Products
tomatic selection of the b est implementation among
several adder/subtractor/multiplier implementations,
Databook, 1996. a time-consuming part of arithmetic optimization.