Fast Division Algorithm with a Small Lookup Table Patrick Hung

Fast Division Algorithm with a Small Lookup Table Patrick Hung, Hossam Fahmy, Oskar Mencer, Michael J. Flynn Computer Systems Laboratory Stanford University, CA 94305 email: hung, hfahmy, oskar, flynn @arithmetic.stanford.edu Abstract 2. Basic Algorithm This paper presents a new division algorithm, which requires ¾Ñ Let and be two -bit fixed point numbers between one two multiplication operations and a single lookup in a small Ü Ý ¾¼ ½ and two defined by Equations 1 and 2 where . table. The division algorithm takes two steps. The table ½ ¾ ´¾Ñ ½µ Ü ·¾ Ü · ·¾ Ü ½·¾ ¾ ¾Ñ ½ lookup and the first multiplication are processed concurrently ½ (1) ½ ¾ ´¾Ñ ½µ ½·¾ Ý ·¾ Ý · ·¾ Ý in the first step, and the second multiplication is executed ¾ ¾Ñ ½ ½ (2) in the next step. This divider uses a single multiplier and a Ñ To calculate , is first decomposed into two groups: ´¾Ñ ·½µ ¾Ñ lookup table with ¾ bits to produce -bit results Ð that are guaranteed correct to one ulp. By using a multiplier the higher order bits ( ) and the lower order bits ( ). Ñ ·½ contains the most significant bits and Ð contains the and a ½¾ KB lookup table, the basic algorithm generates a ½ remaining Ñ bits. ¾-bit result in two cycles. ½ ´Ñ ½µ Ñ ½·¾ Ý · ·¾ Ý ·¾ Ý ½ Ñ ½ Ñ (3) ´Ñ·½µ ´¾Ñ ½µ ¾ Ý · ·¾ Ý 1. Introduction Ñ·½ ¾Ñ ½ Ð (4) Ñ ½ ¾ ¾ ÑÜ Division is an important operation in many areas of comput- The range of is between and ( ), and Ñ ´¾Ñ ½µ ¼ ¾ ¾ ÐÑÜ ing, such as signal processing, computer graphics, network- the range of Ð is between and ( ). ing, numerical and scientific applications. In general, divi- Dividing by , we get Equations 5 and 6. Since Ñ ¾ ¡ sion algorithms may be divided into five categories: digit re- Ð , the maximum fractional error in Equation 6 is less ¾Ñ ½¾ currence, functional iteration, high radix, table lookup, and than ¾ (or ulp). variable latency. These algorithms differ in overall latency ´ µ Ð and area requirements. An overview of division algorithms (5) ¾ ¾ · Ð can be found in [4]. Ð ´ µ Ð This paper introduces a new high radix division algorithm (6) ¾ based on the well-known Taylor series expansion. A number of high radix division algorithms were also proposed in the Using Taylor series, Equation 5 can be expanded at Ð as past based on the Taylor series. For example, Farmwald [2] in Equation 7. The approximation in Equation 6 is equivalent proposed using multiple tables to look up the first few terms to combining the first two terms in the Taylor series. in the Taylor series. Later, Wong [5] proposed an elaborate ¾ iterative quotient approximation with multiple lookup tables. Ð Ð ´½ · µ (7) Wong demonstrated that only the first two terms in the Taylor ¾ · Ð series are necessary to achieve fast division because of the time to evaluate all the power terms. Figure 1 shows the block diagram of the algorithm. In The previous algorithms consider each individual term in ¾ the first step, the algorithm retrieves the value of ½ from a the Taylor series separately; hence, many lookup tables are ´ µ Ð lookup table and multiplies with at the same time. needed and the designs are complicated. Our proposed algo- ¾ ½ ¡ ´ µ Ð In the second step, and are multiplied rithm combines the first two terms of Taylor series together, together to generate the result. and only requires a small lookup table to generate accurate results. This algorithm achieves fast division by multiplying 2.1. Lookup Table Construction the dividend in the first step, which is done in parallel with the table lookup. In the second step, another multiplication To minimize the size of the lookup table, the table entries are operation is executed to generate the quotient. normalized such that the most significant bit (MSB) of each encoding schemes. In Booth 2 encoding, the multiplier is par- titioned into overlapping strings of 3 bits, and each string is used to select a single partial product. Unlike conventional Booth 2 encoding, the encoding of ´ µ Ð consists of four types of encoders. Figure 2 shows the locations of these four types of encoders: the Ð group contains all the 3-bit strings that reside entirely within Ð ; the boundary string contains some Ð bits as well as some bits; the first string is located next to the boundary string; the group contains all the remaining strings within . Figure 1: Basic Algorithm Figure 2: Booth Encoding entry is one. These MSB’s are therefore not stored in the Ð The bits represent positive numbers, whereas the table. bits represent negative numbers. Hence, conventional Booth ¾ Ñ ¿ ½ A lookup table with = is shown in Table 1. 2 encoding is used in the group but the partial products ¾ ¾Ñ ·¾ represents the truncated value of ½ to significant in the Ð group are negated. As shown in the diagram, the ¾ bits. The exponent part of the ½ may be stored in the Ð boundary region between and requires two additional same table, but can also be determined by some simple logic special encoders. Depending on whether Ñ is even or odd, ½¼¼ Ý Ý ¾ gates. In this example, the exponent is when ½ the encoding schemes for these two encoders are different. It Ý ¼ ¼½¼ Ý ¼ Ý Ý ½ ½ ¾ ¿ ¿ , the exponent is when and , is possible that only one such encoder is used in the bound- ¼¼½ Ý ½ the exponent is when ½ . ary region, but it implies that this encoder needs to generate ¿¢ Ñ multiplicand (for even ). In order to speed up the ¿ Table 1: A simple lookup table example (Ñ ) multiplication and simplify the encoding logic, two special encoders are used to avoid the “difficult” multiples. ¾ ½ Table entry Table 2 summarizes the four different encoding schemes Ñ ¼¼¼ ½¼¼¼¼¼¼¼ ¢ ½¼¼ ¼¼¼¼¼¼¼ ½ for both even and odd . It is important to note that the first encoder actually needs to examine both the first string ½¼¼½ ½½¼¼½¼½¼ ¢ ¼½¼ ½¼¼½¼½¼ and the boundary string when Ñ is odd. If the boundary string ½¼½¼ ½¼½¼¼¼½½ ¢ ¼½¼ ¼½¼¼¼½½ ½¼½ ¼ ½ is , the LSB of the first string is set to instead of .If ½¼½½ ½¼¼¼¼½½½ ¢ ¼½¼ ¼¼¼¼½½½ ½¼½ the boundary string is not , the LSB of the first string ½½¼¼ ½½½¼¼¼½½ ¢ ¼¼½ ½½¼¼¼½½ is set to be the MSB of the boundary string (as usual). This ½¼½ ½½¼¼¼¼¼½ ¢ ¼¼½ ½¼¼¼¼¼½ ½ encoding scheme uses all but two normal Booth encoders and ½½¼ ½¼½¼¼½½½ ¢ ¼¼½ ¼½¼¼½½½ ½ is particularly useful if the same multiplier hardware is used ½½½ ½¼¼½¼¼¼½ ¢ ¼¼½ ¼¼½¼¼¼½ ½ for both the first and the second multiplications. 2.2. Booth Encoding 2.3. Error Analysis Booth encoding algorithm [1] has widely been used to min- There are four sources of errors: Taylor series approximation Ì imize the number of partial product terms in a multiplier. In error ( ½ ), lookup table rounding error ( ), the rounding ½ our division algorithm, special Booth encoders are needed to error of the first multiplication ( Å ), and the rounding error ´ µ Å ¾ Ð achieve the multiplication without explicitly cal- of the second multiplication ( ). ´ µ · · · Ð ½ Ì Å ½ Å ¾ culating the value of . Lyu and Matula [3] proposed The total error is equal to . a general redundant binary booth recoding scheme. In our To minimize this error, the divider can be designed such that ¼ ¼ ¼ ¼ Ð ½ Ì Å ½ Å ¾ case, the and bits are non-overlapping, and a cheaper , , , and . This means Ñ ·¾ and faster encoding scheme is feasible. that the table entries are truncated to ¾ bits, the first Ñ ·¾ We use Booth 2 encoding to illustrate our encoding al- multiplication is truncated to ¾ bits, and the second Ñ gorithm, but the same principle can apply to the other Booth multiplication is rounded up to ¾ bits. Ì ´ µ In order to minimize ½ , is set to be slightly larger ´ µ ¾ Ð Table 2: Booth Encoding of ½ than . For each , the optimum table entry is deter- ¼ mined by setting the maximum positive error (at Ð )to Boundary First ÐÑÜ be the same as the maximum negative error (at Ð ). Ð even odd even odd Equation 10 shows the expression for the optimum table entry Ñ Ñ Ñ Bits Group Group Ñ Ì ´ µ ÓÔØ . ¼¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼¼ ½ ·½ ½ ½ ½ ¼ ·½ ¾ · ÐÑÜ ¼½ ¼ ·½ ½ ½ ·½ ·½ ·½ Ì ´ µ ÓÔØ (10) ´ · µ´¾ µ ÐÑÜ ÐÑÜ ¼½ ½ ·¾ ¾ ¾ ¼ ·½ ·¾ ½¼ ¼ ¾ ·¾ ·¾ ¾ ¾ ¾ ½ The approximation error is at its maximum when and ½¼ ½ ½ ·½ ·½ ·½ ¾ ½ ÐÑÜ Ð . Using Equation 9, the maximum approximation ¼ ½ ·½ ·½ ½ ½ ½ ½½ error can easily be derived as in Equation 11. In this case, ½½ ½ ¼ ¼ ¼ ¾ ½ ¼ ½ ½ is slightly less than ulp. ¾ ¡ ½ The Taylor series approximation error ( ) is determined ÐÑÜ ½ (11) ´½ · µ´¾ µ Ð ÐÑÜ by Equation 8. This error is most significant for large ÐÑÜ ½ and small . The maximum approximation error is ½ ½ Using round-to-nearest rounding mode, Å ulp, ½¾ ½ ÐÑÜ slightly less than ulp when Ð and . ½¾ ½ ¾ Ì Å ulp, and ulp. As in Section 2.3, ¾ ´ µ ¡ the total error of the alternative lookup table is also less than Ð Ð ½ (8) ¾ ¾ ½ ulp. Table 3 illustrates the same example shown in Sec- ¡ ÊÆ Ì ´ µ tion 2.1 with the alternative lookup table, where ¾Ñ ·¾ ´ µ ¾Ñ ·¾ The lookup table has significant bits, so the maximum Ì represents the round-to-nearest value of to sig- ½ truncation error Ì ulp. Similarly, the maximum nificant bits. ½ ½ rounding error for the first multiplication Å ulp, and the maximum rounding error for the second multiplica- ¿ Table 3: Alternative Lookup Table (Ñ ) ½ ¾ tion Å ulp. Thus, the maximum positive error is less ½ ¾ than ulp ( Å ), and the maximum negative error is also ÊÆ Ì ´ µ Table entry ½ · · Ì Å ½ less than ulp ( ½ ). ½¼¼¼ ½¼¼¼¼¼¼½ ¢ ½¼¼ ¼¼¼¼¼¼½ ½¼¼½ ½½¼¼½¼½½ ¢ ¼½¼ ½¼¼½¼½½ 3. Optimization

Fast Division Algorithm with a Small Lookup Table Patrick Hung

Characterization Quaternaty Lookup Table in Standard Cmos Process

A Reduced-Complexity Lookup Table Approach to Solving Mathematical Functions in Fpgas

AC Induction Motor Control Using the Constant V/F Principle and a Natural

Binary Search Algorithm Anthony Lin¹* Et Al

A ' Microprocessor-Based Lookup Bearing Table

Design and Implementation of Improved NCO Based on FPGA

Pushing the Communication Barrier in Secure Computation Using Lookup Tables (Full Version)∗

Dissertation a Methodology for Automated Lookup

Download Article (PDF)

On the Claim That a Table-Lookup Program Could Pass the Turing Test to Appear, Minds and Machines

Lecture 5 Memoization/Dynamic Programming the String

EE 423: Class Demo 1 Introduction in Lab 1, Two Diﬀerent Methods for Generating a Sinusoid Are Used