Design Issues in Floating-Point Division

DESIGN ISSUES IN FLOATING-POINT DIVISION Stuart F. Ob erman and Michael J. Flynn Technical Rep ort: CSL-TR-94-647 Decemb er 1994 This work was supp orted by NSF under contract MIP93-13701. DESIGN ISSUES IN FLOATING-POINT DIVISION by Stuart F. Ob erman and Michael J. Flynn Technical Rep ort: CSL-TR-94-647 Decemb er 1994 Computer Systems Lab oratory Departments of Electrical Engineering and Computer Science Stanford University Stanford, California 94305-4055 Abstract Floating-p oint division is generally regarded as a low frequency, high latency op eration in typical oating-p oint applications. However, the increasing emphasis on high p erformance graphics and the industry-wide usage of p erformance b enchmarks, such as SPECmarks, forces pro cessor designers to pay close attention to all asp ects of oating-p oint computation. This pap er presents the algorithms often utilized for oating-p oint division, and it also presents implementation alternatives available for designers. Using a system level study as a basis, it is shown howtypical oating-p oint applications can guide the designer in making implementation decisions and trade-o s. Key Words and Phrases: Floating-p oint, division, b enchmarks, system p erformance c Copyright 1994 by Stuart F. Ob erman and Michael J. Flynn Contents 1 Intro duction 1 2 Divide Algorithms: General Discussion 2 2.1 Subtractive Algorithms :: ::: :: ::: ::: ::: ::: ::: :: ::: :: 2 2.2 Multiplicative Algorithms : ::: :: ::: ::: ::: ::: ::: :: ::: :: 3 2.3 Comparison ::: ::: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 5 3 System Level Study 5 3.1 Instrumentation : ::: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 5 3.2 Metho d of Analysis :: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 6 4 Results 6 4.1 Instruction Mix : ::: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 6 4.2 Compiler E ects ::: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 6 4.3 Overall CPI ::: ::: ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 9 4.4 Shared Multiplier E ects :: ::: :: ::: ::: ::: ::: ::: :: ::: :: 10 4.5 On-the- y Rounding and Conversion ::: ::: ::: ::: ::: :: ::: :: 12 4.6 Consumers of Divide Results :: :: ::: ::: ::: ::: ::: :: ::: :: 13 5 Conclusion 15 6 Acknowledgement 15 iii List of Figures 1 Basic SRTTop ology : ::: ::: :: ::: ::: ::: ::: ::: :: ::: :: 4 2 Instruction count distribution :: :: ::: ::: ::: ::: ::: :: ::: :: 7 3 Functional unit stall time distribution :: ::: ::: ::: ::: :: ::: :: 7 4 Spice with optimization O0 ::: :: ::: ::: ::: ::: ::: :: ::: :: 8 5 Spice with optimization O3 ::: :: ::: ::: ::: ::: ::: :: ::: :: 8 6 Cumulativeaverage interlo ck distance :: ::: ::: ::: ::: :: ::: :: 9 7 Average e ective divide latency : :: ::: ::: ::: ::: ::: :: ::: :: 9 8 CPI and area vs divide latency : :: ::: ::: ::: ::: ::: :: ::: :: 11 9 CPI and area vs divide latency : :: ::: ::: ::: ::: ::: :: ::: :: 11 10 Excess CPI due to shared multiplier ::: ::: ::: ::: ::: :: ::: :: 12 11 E ects of on-the- y rounding and conversion : ::: ::: ::: :: ::: :: 13 12 Consumers of divide results ::: :: ::: ::: ::: ::: ::: :: ::: :: 14 13 Consumers of multiply results :: :: ::: ::: ::: ::: ::: :: ::: :: 14 iv List of Tables 1 E ects of compiler optimization : :: ::: ::: ::: ::: ::: :: ::: :: 10 v 1 Intro duction Mo dern computer applications have increased in their computation complexity in recent years. The development of high sp eed oating-p oint FP arithmetic is a requirementto meet the computation demands of mo dern applications whichhave increased in their computation complexity in recentyears. The emphasis on high p erformance graphics rendering systems has placed further demands on the computation abilities of pro cessors. Further- more, the industry-wide usage of p erformance b enchmarks, such as SPECmarks, forces pro cessor designers to pay particular attention to oating-p oint computation. Applications such as the aforementioned comprise several oating p oint op erations, among them addition, multiplication, and division. In recent FPUs, emphasis has b een placed on designing ever-faster adders and multipliers, with division receiving less attention. Typically, the range for addition latency is 2 to 4 cycles, and the range for multiplication is 2 to 8 cycles. In contrast, the latency for double precision division in mo dern FPUs ranges from 7 to 61 cycles [4]. This phenomenon is largely due to the p erception that divide is an infrequent op eration in mo dern oating-p oint applications. Because of the low frequency, it is b elieved that the overall p erformance degradation incurred by the use of a slow divider will not b e large. More emphasis has b een placed on improving the p erformance of addition and multiplication. As the p erformance gap widened b etween these two op erations and division, oating-p oint algorithms and applications have b een slowly rewritten to account for this gap by mitigating the use of divide. Thus, current applications and b enchmarks are usually written assuming that divide is an inherently slow op eration and should therefore b e used sparingly. While the metho dology for designing ecient high-p erformance adders and multipliers is well-understo o d, the design of dividers still remains a serious design challenge, often viewed as a \black-art" among system designers. Extensive theory exists describing the theory of division. However, the implementation of division has received less attention, and very little emphasis has b een placed on studying the e ects of FP division on overall system p erformance. This study investigates in more detail the relationship b etween FP divide and system p erformance. This relationship is studied in the context of a set of oating-p oint applications. The choice of applications to use when studying the p erformance of a system is often dicult and controversial. The application suites considered for this study included the NAS Parallel Benchmarks [5], the Perfect Benchmarks [8], and the SPECfp92 [10] b enchmark suite. An initial analysis of the instruction distribution showed that the SPEC b enchmarks had the highest frequency of oating-p oint op erations, and they were therefore chosen as the target workload of the study to b est re ect the b ehavior of oating-p oint intensive applications. These applications are used to investigate several questions regarding the implementation of oating-p oint division: Do es a high-latency divide op eration cause enough system p erformance degradation to warrant a separate, lower latency divide functional unit? Howwell can a compiler schedule co de in order to maximize the distance b etween 1 divide result pro duction and consumption? What are the e ects of increasing the width of instruction issue on e ective divide latency? If a hardware divide unit is warranted, should the divider share the FP multiplier hardware, or should it have its own dedicated functional unit? Is on-the- y rounding and conversion necessary? The remainder of this pap er is organized as follows. Section 2 presents common division algorithms and implementations. Section 3 describ es the metho d of obtaining data from the b enchmarks. Section 4 presents and analyzes the results of the study. Section 5 is the conclusion. 2 Divide Algorithms: General Discussion Many classes of algorithms exist for implementing division. These include the subtractive metho d, the multiplicative metho d, various approximation metho ds, and sp ecial metho ds such as the CORDIC and continued pro duct metho ds [1]. The most commonly used algorithms in mo dern FPUs are the subtractive and multiplicative metho ds, and the analysis here is limited to these. 2.1 Subtractive Algorithms Digit recurrence algorithms use subtraction as the iterative op erator. The quotient is rep- resented in a radix-r form and one digit of it is calculated in every iteration. This class b e can b e further divided into r estor ing and nonr estor ing division. Restoring division is similar to the familiar pap er and p encil division. When dividing two n-bit numb ers, the division can require up to 2n + 1 adds. The Winograd b ound [13] on restoring division in gate delays is therefore: T =2n+ 1 log 2n 2 Nonrestoring division eliminates the restoration cycles. Accordingly, the b ound on nonrestoring division is given by: T = n log 2n 2 SRT is a nonrestoring division algorithm that is basically a trial and error pro cess [11]. It utilizes the following relationship: P = rP q D j +1 j j +1 To calculate a next partial remainder, the divisor is multiplied by the next quotient digit, and the result is subtracted from the pro duct of the last partial remainder, or dividend for the rst iteration, and a radix r. The next quotient digit is obtained by supplying a xed numb er of bits from the last partial remainder, approximately 8 bits for a radix-4 divider, 2 to a lo ok-up table. By cho osing a radix to b e a p ower of 2, the pro duct of the radix and the last partial remainder can b e formed by shifting. Similarly, the various pro ducts of the divisor multiplied by the next quotient digit can b e formed bymultiplexing di erent multiples of the divisor. However, the problem with this basic scheme is that it requires a full-width subtractor. Consequently, this scheme can b e very slow. In order to improve up on this basic scheme, some redundancy is often intro duced into the algorithm. An extra constraint is added to provide redundancy: jP j <kD j +1 where k = n / r - 1 and n is the numb er of p ositive allowed digits for the next quotient digit. A design tradeo can b e noted in this relationship. By using a large numb er of allowed digits for the next quotient digit, and thus a large value for k, a smaller lo ok-up table is required, and thus the complexity and latency of the table lo ok-up can b e reduced. However, cho osing a smaller numb er of allowed digits for the quotient simpli es the generation of the multiple of the divisor. Multiples that are p owers of two can b e formed by simply shifting. If a multiple is required that is not a p ower of two e.g. three, an additional op eration such as addition may also b e required, which can add to the complexity and latency of the divisor multiple generating pro cess.

Design Issues in Floating-Point Division

Data-Level Parallelism

Lecture 14: Gpus

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Design and Implementation of a Multithreaded Associative Simd Processor

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors

Instruction Set Innovations for Convey's HC-1 Computer

VLIW DSP Vs. Superscalar Implementation of a Baseline H.263

The RISC-V Instruction Set Manual Volume I: User-Level ISA Document Version 2.2

GPU Architecture and Programming

CSC506 Lecture 7 Vector Processors

Computer Architecture and Design