Fast Division with a Small Lookup Table

Patrick Hung, Hossam Fahmy, Oskar Mencer, Michael J. Flynn

Computer Systems Laboratory

Stanford University, CA 94305  email: hung, hfahmy, oskar, flynn @arithmetic.stanford.edu

Abstract 2. Basic Algorithm

This paper presents a new division algorithm, which requires

 ¾Ñ Let  and be two -bit fixed point numbers between one

two multiplication operations and a single lookup in a small

Ü Ý ¾¼ ½  and two defined by Equations 1 and 2 where  .

table. The division algorithm takes two steps. The table

½ ¾ ´¾Ñ ½µ

Ü ·¾ Ü ·  ·¾ Ü   ½·¾

¾ ¾Ñ ½

lookup and the first multiplication are processed concurrently ½ (1)

½ ¾ ´¾Ñ ½µ

 ½·¾ Ý ·¾ Ý ·  ·¾ Ý

in the first step, and the second multiplication is executed 

¾ ¾Ñ ½ ½ (2)

in the next step. This divider uses a single multiplier and a

Ñ 

To calculate   , is first decomposed into two groups:

´¾Ñ ·½µ ¾Ñ

lookup table with ¾ bits to produce -bit results

  

Ð 

that are guaranteed correct to one ulp. By using a multiplier the higher order bits (  ) and the lower order bits ( ).

Ñ ·½ 

contains the most significant bits and Ð contains the



and a ½¾ KB lookup table, the basic algorithm generates a

½ remaining Ñ bits.

¾-bit result in two cycles.

½ ´Ñ ½µ Ñ

  ½·¾ Ý ·  ·¾ Ý ·¾ Ý

½ Ñ ½ Ñ

 (3)

´Ñ·½µ ´¾Ñ ½µ

 ¾ Ý ·  ·¾ Ý

1. Introduction 

Ñ·½ ¾Ñ ½

Ð (4)

Ñ

 ½  ¾ ¾ ÑÜ

Division is an important operation in many areas of comput- The range of  is between and ( ), and

Ñ ´¾Ñ ½µ

 ¼  ¾ ¾ ÐÑÜ

ing, such as , , network- the range of Ð is between and ( ).

   

ing, numerical and scientific applications. In general, divi- Dividing by , we get Equations 5 and 6. Since 

Ñ

¾ ¡ 

sion may be divided into five categories: digit re- Ð , the maximum fractional error in Equation 6 is less

¾Ñ

½¾ currence, functional iteration, high radix, table lookup, and than ¾ (or ulp).

variable latency. These algorithms differ in overall latency

   ´  µ

 Ð 

and area requirements. An overview of division algorithms  (5)

¾ ¾

  ·   

 Ð 

can be found in [4]. Ð

  ´  µ Ð This paper introduces a new high radix division algorithm 

 (6)

¾

  based on the well-known expansion. A number 

of high radix division algorithms were also proposed in the

   Using Taylor series, Equation 5 can be expanded at Ð as past based on the Taylor series. For example, Farmwald [2] in Equation 7. The approximation in Equation 6 is equivalent proposed using multiple tables to look up the first few terms to combining the first two terms in the Taylor series.

in the Taylor series. Later, Wong [5] proposed an elaborate

¾

  

iterative quotient approximation with multiple lookup tables. 

Ð

Ð

´½ · µ  (7)

Wong demonstrated that only the first two terms in the Taylor ¾

 ·    

 Ð   series are necessary to achieve fast division because of the  time to evaluate all the power terms. Figure 1 shows the block diagram of the algorithm. In

The previous algorithms consider each individual term in ¾  the first step, the algorithm retrieves the value of ½ from a

the Taylor series separately; hence, many lookup tables are 

 ´  µ Ð lookup table and multiplies with  at the same time.

needed and the designs are complicated. Our proposed algo- ¾

½  ¡ ´  µ Ð

In the second step, and  are multiplied  rithm combines the first two terms of Taylor series together, together to generate the result. and only requires a small lookup table to generate accurate results. This algorithm achieves fast division by multiplying 2.1. Lookup Table Construction the dividend in the first step, which is done in parallel with the table lookup. In the second step, another multiplication To minimize the size of the lookup table, the table entries are operation is executed to generate the quotient. normalized such that the most significant bit (MSB) of each encoding schemes. In Booth 2 encoding, the multiplier is par- titioned into overlapping strings of 3 bits, and each string is used to select a single partial product.

Unlike conventional Booth 2 encoding, the encoding of

´  µ Ð

 consists of four types of encoders. Figure 2 shows 

the locations of these four types of encoders: the Ð group 

contains all the 3-bit strings that reside entirely within Ð ; the

  

boundary string contains some Ð bits as well as some bits; 

the first  string is located next to the boundary string; the

    group contains all the remaining strings within .

Figure 1: Basic Algorithm Figure 2: Booth Encoding

entry is one. These MSB’s are therefore not stored in the

  Ð The  bits represent positive numbers, whereas the table.

bits represent negative numbers. Hence, conventional Booth

¾

Ñ ¿ ½

A lookup table with = is shown in Table 1. 

 

2 encoding is used in the  group but the partial products

¾

 ¾Ñ ·¾

represents the truncated value of ½ to significant

 

in the Ð group are negated. As shown in the diagram, the

¾ 

bits. The exponent part of the ½ may be stored in the



  Ð boundary region between  and requires two additional same table, but can also be determined by some simple logic

special encoders. Depending on whether Ñ is even or odd,

½¼¼ Ý  Ý  ¾

gates. In this example, the exponent is when ½ the encoding schemes for these two encoders are different. It

Ý ¼ ¼½¼ Ý ¼ Ý  Ý ½

½ ¾ ¿

¿ , the exponent is when and , is possible that only one such encoder is used in the bound-

¼¼½ Ý ½

the exponent is when ½ . ary region, but it implies that this encoder needs to generate

¿¢ Ñ

multiplicand (for even ). In order to speed up the ¿ Table 1: A simple lookup table example (Ñ ) multiplication and simplify the encoding logic, two special

encoders are used to avoid the “difficult” multiples.

¾

 ½ 

 Table entry Table 2 summarizes the four different encoding schemes



Ñ

¼¼¼ ½¼¼¼¼¼¼¼ ¢ ½¼¼ ¼¼¼¼¼¼¼

½ for both even and odd . It is important to note that the first

  

 encoder actually needs to examine both the first string

½¼¼½ ½½¼¼½¼½¼ ¢ ¼½¼ ½¼¼½¼½¼

and the boundary string when Ñ is odd. If the boundary string

½¼½¼ ½¼½¼¼¼½½ ¢ ¼½¼ ¼½¼¼¼½½

½¼½  ¼ ½

is , the LSB of the first  string is set to instead of .If

½¼½½ ½¼¼¼¼½½½ ¢ ¼½¼ ¼¼¼¼½½½

½¼½ 

the boundary string is not , the LSB of the first  string

½½¼¼ ½½½¼¼¼½½ ¢ ¼¼½ ½½¼¼¼½½

is set to be the MSB of the boundary string (as usual). This

½¼½ ½½¼¼¼¼¼½ ¢ ¼¼½ ½¼¼¼¼¼½

½ encoding scheme uses all but two normal Booth encoders and

½½¼ ½¼½¼¼½½½ ¢ ¼¼½ ¼½¼¼½½½

½ is particularly useful if the same multiplier hardware is used

½½½ ½¼¼½¼¼¼½ ¢ ¼¼½ ¼¼½¼¼¼½ ½ for both the first and the second multiplications.

2.2. Booth Encoding 2.3. Error Analysis

Booth encoding algorithm [1] has widely been used to min- There are four sources of errors: Taylor series approximation

Ì

imize the number of partial product terms in a multiplier. In error ( ½ ), lookup table rounding error ( ), the rounding

½

our division algorithm, special Booth encoders are needed to error of the first multiplication ( Å ), and the rounding error

 ´  µ

Å ¾ Ð

achieve the  multiplication without explicitly cal- of the second multiplication ( ).

´  µ · · ·

Ð ½ Ì Å ½ Å ¾ culating the value of  . Lyu and Matula [3] proposed The total error is equal to .

a general redundant binary booth recoding scheme. In our To minimize this error, the divider can be designed such that

  ¼ ¼ ¼ ¼

Ð ½ Ì Å ½ Å ¾

case, the  and bits are non-overlapping, and a cheaper , , , and . This means

Ñ ·¾

and faster encoding scheme is feasible. that the table entries are truncated to ¾ bits, the first

Ñ ·¾

We use Booth 2 encoding to illustrate our encoding al- multiplication is truncated to ¾ bits, and the second Ñ gorithm, but the same principle can apply to the other Booth multiplication is rounded up to ¾ bits.

Ì ´ µ 

In order to minimize ½ , is set to be slightly larger

´  µ

¾ Ð

Table 2: Booth Encoding of 

½ 

than . For each  , the optimum table entry is deter-



 ¼

mined by setting the maximum positive error (at Ð )to

Boundary First

   ÐÑÜ

be the same as the maximum negative error (at Ð ).

Ð even odd even odd

Equation 10 shows the expression for the optimum table entry

Ñ Ñ Ñ

Bits Group Group Ñ

Ì ´ µ ÓÔØ

 .

¼¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼

¼¼ ½ ·½ ½ ½ ½ ¼ ·½

¾ · 

 ÐÑÜ

¼½ ¼ ·½ ½ ½ ·½ ·½ ·½

Ì ´ µ  ÓÔØ

 (10)

 ´ ·  µ´¾  µ

  ÐÑÜ  ÐÑÜ

¼½ ½ ·¾ ¾ ¾ ¼ ·½ ·¾

½¼ ¼ ¾ ·¾ ·¾ ¾ ¾ ¾

 ½

The approximation error is at its maximum when  and

½¼ ½ ½ ·½ ·½ ·½ ¾ ½

   ÐÑÜ

Ð . Using Equation 9, the maximum approximation

¼ ½ ·½ ·½ ½ ½ ½

½½ error can easily be derived as in Equation 11. In this case,

½½ ½ ¼ ¼ ¼ ¾ ½ ¼

½

½ is slightly less than ulp.

¾

 ¡ 

½

The Taylor series approximation error ( ) is determined ÐÑÜ



½ (11)



´½ ·  µ´¾  µ

Ð ÐÑÜ

by Equation 8. This error is most significant for large ÐÑÜ

 ½

and small  . The maximum approximation error is

½ ½

Using round-to-nearest rounding mode, Å ulp,

½¾     ½

ÐÑÜ 

slightly less than ulp when Ð and .

½¾ ½

¾ Ì

Å ulp, and ulp. As in Section 2.3,

¾

 ´  µ  ¡ 

 the total error of the alternative lookup table is also less than

 Ð Ð

 

½ (8)

¾ ¾

½ ulp. Table 3 illustrates the same example shown in Sec-



  ¡ 

 

ÊÆ Ì ´ µ

tion 2.1 with the alternative lookup table, where 

¾Ñ ·¾

´ µ ¾Ñ ·¾ The lookup table has significant bits, so the maximum Ì

represents the round-to-nearest value of  to sig-

½

truncation error Ì ulp. Similarly, the maximum nificant bits.

½ ½ rounding error for the first multiplication Å ulp,

and the maximum rounding error for the second multiplica- ¿

Table 3: Alternative Lookup Table (Ñ )

½ ¾

tion Å ulp. Thus, the maximum positive error is less

½ ¾

than ulp ( Å ), and the maximum negative error is also

 ÊÆ Ì ´ µ 

 Table entry

½ · ·

Ì Å ½

less than ulp ( ½ ).

½¼¼¼ ½¼¼¼¼¼¼½ ¢ ½¼¼ ¼¼¼¼¼¼½

½¼¼½ ½½¼¼½¼½½ ¢ ¼½¼ ½¼¼½¼½½

3. Optimization Techniques

½¼½¼ ½¼½¼¼½¼½ ¢ ¼½¼ ¼½¼¼½¼½

¼½½ ½¼¼¼½¼¼¼ ¢ ¼½¼ ¼¼¼½¼¼¼

This section describes two optimization techniques for the di- ½

½¼¼ ½½½¼¼½¼¼ ¢ ¼¼½ ½½¼¼½¼¼

vision algorithm. The first technique uses a slightly different ½

½¼½ ½½¼¼¼¼½¼ ¢ ¼¼½ ½¼¼¼¼½¼

lookup table and allows the two multiplications to use the ½

½½¼ ½¼½¼½¼¼¼ ¢ ¼¼½ ¼½¼½¼¼¼

same rounding mode, whereas the second technique uses an ½

½½½ ½¼¼½¼¼½¼ ¢ ¼¼½ ¼¼½¼¼½¼ error compensation term to further reduce the Taylor series ½ approximation error.

3.2. Error Compensation 3.1. Alternative Lookup Table The Taylor series approximation error can be further reduced As described in Section 2.3, the rounding modes of the first by adding an error compensation term in the first multiplica- and the second multiplications are different. This may be un- tion. Equation 8 shows that the magnitude of the Taylor se-

desirable if the two multiplications need to share the same

 Ð ries approximation error ( ½ ) increases when either gets

multiplier. A simple solution is to use round-to-nearest mode 

larger or  gets smaller. By looking at the first few bits of

in the two multiplications as well as in constructing the lookup

   

 Ð  Ð and , it is possible to identify large and small , and table. Since the error terms can either be positive or negative, then compensate for the approximation error. However, it is the maximum total error becomes the sum of the maximum important to ensure that the approximation error is not over- of each error term.

compensated and becomes positive; otherwise this would in-

Ì ´ µ   Let  be the table entry at with infinite precision.

crease the total error (Section 2.3).

The expression for the approximation error ½ is shown in Figure 3 depicts a simple error compensation scheme. In

Equation 9 below.

´ µ this diagram,  represents the positive error compensa-

 tion that is used to correct the negative Taylor series approxi-

  ´  µÌ ´ µ

 Ð 

½ (9) mation error. The new approximation equation becomes:

 · 

 Ð

¾Ñ ¾

 ´ µ  ¾ ´Ý ¡ Ý ¡ Ý Ý µ

Ñ·½ Ñ·¾ Ñ·½ ½ (15)

4. Discussion

This paper presents a simple and fast division algorithm based

on Taylor series expansion. Using a multiplier and a lookup

Ñ

´¾Ñ ·½µ ¾Ñ

table with ¾ bits, this algorithm produces a -bit

 result in two steps. For example, a ½¾ KB lookup table is required for single precision (24 bits) floating point division. The same principle can be applied to some elementary

functions, such as square root. Using the same definitions of

   Ð

,  , and as before, we get the following approximation.

¾

Ô

  ´  ¾µ ¿ 

 Ð Ð Ð

  µ  · ´½

¿ ½ (16)

¾

¾

¾ ¾





  

 

Figure 3: Error Compensation This is very similar to the approximation used in the division 

algorithm. The differences are that the Ð term is shifted by

¿

¾

½

one bit in the numerator and the lookup table contains 

¾

½

entries instead of  entries.

  ´  ·  ´ µµ Ð  We can also combine more than the first two terms in the

 (12)

¾ 

 Taylor series expansion. For example, if we use the first four 

terms in the expansion, we get the following approximation.

´ µ

The expression for  is shown in Equation 13. De-

¾

   ´ µ ¼

 Ð

  ´  µ pending on the first few terms of and , is set to , 

Ð  Ð

´¾Ñ·¾µ ´¾Ñ·½µ

´½ · µ

 (17)

¾ ¾ ¾

,or . ¾



 

 

´¾Ñ·¾µ

 ´ µ¾ ¡ Ý ¡ Ý ´½ · Ý ¡ Ý ¡ Ý µ

·½ Ñ·¾ ½ ¾ Ñ·¿

Ñ (13)

¾

½

As before, only a single  lookup table is needed but this

¾ 

The error compensation only requires an additional partial algorithm also needs to calculate Ð . Our next step is to gen- product in the multiplier and a few simple logic gates (shown eralize the existing algorithm and to investigate the optimum

in Equation 13). The maximum Taylor series approximation Taylor series approximation for different input precisions.

¿¾   ½   ´¿ ½µ

Ð ÐÑÜ

error is ulp when  and .  Figure 4 shows the approximation error with different Ð . 5. References

[1] A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathemat- ics, pages 236Ð240, June 1951. [2] P. M. Farmwald. On the design of high performance dig- ital arithmetic units. PhD thesis, Stanford University, 1981. [3] .N. Lyu and D. Matula. Redundant Binary Booth Re- coding. In Proceedings of IEEE Symposium on Computer Arithmetic, pages 50Ð57, July 1995. Figure 4: Approximation Error with Error Compensation [4] Stuart F. Obermann and Michael J. Flynn. Division al- A similar error compensation scheme can also be used for gorithms and implementations. IEEE Transactions on the alternative lookup table shown in Section 3.1. The expres- Computers, 46(8):833Ð854, August 1997. sions for the new approximation and the error compensation [5] Derek Wong and Michael J. Flynn. Fast division using are determined by Equations 14 and 15, respectively. In this

accurate quotient approximations to reduce the number

´¾Ñ·¾µ ´¾Ñ·¾µ

´ µ ¾ ¼ ¾ case,  is set to be , ,or depend-

of iterations. IEEE Transactions on Computers, pages

  Ð ing on the first few bits in  and .

981Ð995, August 1992.

 ´¾ ·  µ´  ·  ´ µµ

 ÐÑÜ  Ð

 (14)

  ´ ·  µ´¾  µ

  ÐÑÜ  ÐÑÜ