Quad Precision Floating Point on the IBM Z13™

2016 IEEE 23nd Symposium on Computer Arithmetic Quad Precision Floating Point on the IBM z13™ Cedric Lichtenau1) Steven Carlough2) Silvia Melitta Mueller1) [email protected] [email protected] [email protected] 1) IBM Deutschland Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany 2) IBM Systems and Technology Group, 2455 South Road Poughkeepsie, New York 12601, USA . Engine (DQE), a divide and square root engine, a short Abstract—When operating on a rapidly increasing amount latency decimal fixed-point engine, and a vector fixed- of data, business analytics applications become sensitive to point and string engine which support IBM’s new Single rounding errors, and profit from the higher stability and faster convergence of quad precision floating-point (FP-QP) Instruction Multiple Data (SIMD) architecture for IBM z arithmetic. The IBM z13™ supports this emerging trend Systems™ [7][13]. There are also two load/store pipelines around Big Data with an outstanding FP-QP performance. to read/write the Vector Register File (VRF) and two FXU The paper details the vector and floating-point unit of IBM read and write ports to access the VRF for the General z13™, with special focus on binary FP-QP. Except for divide Purpose Register File of the scalar fixed-point units. and square root, these instructions are executed in the Each VFU pipeline is a total of 10 cycles deep, two decimal engine. To operate such an 8-cycle decimal and cycles of which are for operand bypass and result quad precision pipeline at 5GHz required innovation around forwarding that do not impact the latency of when results exponent handling, normalization, and rounding. are available to dependent operations. The depth of the pipeline was determined by the longest pipeline depth, Keywords—Floating Point Unit, Quad Precision, Analytics, which is the Decimal and Quad-Precision Engine (DQE). Big Data, Decimal, Binary, IBM z13™ When operations complete in shorter pipelines, the results are sent to a forwarding network where they can be I. INTRODUCTION sourced by dependent operations . The IBM z Systems has been the only commercial Section 2 gives an overview of the Decimal and Quad platform to support quad precision floating point in Precision Engine followed by some details in Section 3 on hardware for the past few decades. Quad precision the Arithmetic Engine of the DQE which performs the floating-point has traditionally been used in physics [1][2] adding, rounding and some post corrections. Section 4 or in computational arithmetic applications [3][4][5]. describes the exponent path of the decimal and quad However, the demand for quad precision floating-point precision engine. Section 5 describes the binary divide and operations is expected to grow dramatically as big data and square root engine. Section 6 provides the latency and business analytic workloads become more prevalent. For throughput numbers for the binary quad precision floating- large installations it has recently been observed that on point operations, and gives a comparison to prior design commercial products like ILOG and SPSS, replacing implementations. double precision operations with quad-precision operations in critical routines yield 18% faster convergence due to II. PIPELINED DECIMAL AND QUAD PRECISION ENGINE reduced rounding error. Convergence of analytics and big In prior zSystems designs, quad precision binary and data algorithms can be improved with increased accuracy hexadecimal floating-point operations were executed in a of the intermediate computations [6]. This is driving binary floating-point unit [8]. The dataflow of these binary investment in the development of faster quad precision floating-point units (BFU) is optimized to execute double binary floating-point hardware. precision operations, so quad precision operations had to The Vector and Floating Point Unit (VFU) is the main loop several times through these binary floating-point units execution engine of the IBM z13™ processor shipped in to complete their calculations. This resulted in longer March 2015. Manufactured in IBM’s 22 nm technology, latencies and lower throughput for quad precision the VFU supports a core frequency of 5 GHz and supports operations compared to their double precision counterparts 2-way simultaneous multi-threading. The design point of on these systems. Furthermore, the additional hardware the VFU was chosen to maximize the performance on necessary for binary floating-point units to support quad workloads for Business Analytics and big data, which precision operations increased the area and pipeline depth includes an ever increasing amount of quad precision of the unit, making it costly if they are replicated to floating point operations. support vector floating-point operations. The novel part of The execution engines in the VFU comprise two the z13™ processor is that quad precision binary floating- symmetrical pipes, where each pipe contains a binary point (BFP) and hexadecimal floating-point (HFP) floating point unit (BFU), a Decimal and Quad Precision operations were implemented on a modified decimal 1063-6889/16 $31.00 © 2016 IEEE 87 DOI 10.1109/ARITH.2016.26 floating-point engine. Since the decimal floating-point The densely packed decimal mantissa of the decimal engines are typically designed to execute quad precision floating-point operands are unpacked into binary coded decimal floating-point operations [9] it has a 140-bit wide decimal format and the mantissa of binary and mantissa dataflow, which is wide enough to support the hexadecimal floating-point operands are aligned to the execution of quad precision BFP and HFP operations decimal data path in this hardware. without the need to loop back through the hardware. The The exponent and shift amount calculations start in Decimal and Quad precision Engine (DQE) on the z13™ parallel with the unpacking of the mantissa and the is a pipelined decimal floating point engine that has been detection and handling of special cases such as NaNs or augmented and enhanced to support quad precision BFP infinity. Depending on the exponent difference the input and HFP operations. These changes not only resulted in operands are swapped. This puts the larger operand on the quad precision BFP and HFP operations executing faster left side of the data path and the smaller one on the right on z13™ then on predecessor machines, but had the added side. The shift amount logic computes a left and right shift advantage that area and power could be reduced in the amount for the two swapped operands. The shift amounts BFU making it more efficient to support vector BFP are then used in the next pipeline stages. Note that the shift hardware. amount calculation for decimal floating-point is much A block diagram of the DQE pipeline is shown in more complicated than for binary floating-point. Besides Figure 1. The hardware is designed to run at a system the exponent difference, the decimal shift amount frequency of 5GHz, so some of the dataflow hardware calculation also includes the leading zero count of the blocks illustrated span over two cycles to close timing. input operands, as described in [9]. For binary addition and The DQE is an 8-stage fully pipelined unit capable of subtraction, only the mantissa with the smaller exponent is starting a new operation every cycle for all arithmetical shifted; it is shifted to the right by the exponent difference. operations but multiply, divide and conversions between the binary and decimal radixes. B. Shifter The shift operation is executed in two stages, producing 141b wide data. The first stage shifts by digits as required for decimal and hexadecimal numbers; it shifts the operands up to 36 digits to the left or right. The second stage performs the final right bit shift (0 to 3 bits) for binary numbers. This leaves enough timing headroom to include the multiplexing of the multiply, divide and convert results as well as to invert the second operand for effective subtract operations, and add padding to single and double precision numbers. The padding ensures that for all precisions and radices, the overflow of the mantissa after adding and rounding can be detected at the same position. Without this padding, we would have to collect the mantissa overflow information from six bit positions. The padding therefore saves additional delay along the critical path thru the adder. C. Arithmetical Engine Figure 1. Dataflow of the DQE Unit The Arithmetical Engine (AREN) is the heart of the DQE. It computes sum and absolute difference of the inputs and can be configured to include injection rounding, Multi-cycle operations, like quad precision multiply, to perform a post rounding correction. Furthermore, it reuse the unit dataflow and loop in specific stages of the evaluates result information used for the setting of the pipeline to reduce the size of multiplication hardware IEEE exceptions. The rounding injection happens at two necessary to satisfy area and power budgets [12]. positions in parallel to account for a potential mantissa overflow. The post rounding correction performs a one A. Unpack and Swap digit (decimal numbers) or one bit (binary numbers) The DQE supports seven floating-point data types with correction shift. Details are provided in section III. different mantissa width: • Decimal: QP 134b (34 digits) D. Normalize and Round DP 64b (16 digits) This step consists of two parallel circuits. One circuit SP 28b (7 digits) normalizes the result of the arithmetical engine, while the • Hexadecimal QP 112b (28 digits) other circuit selects the appropriate rounded result which is DP 56b (14 digits) the post rounding correction mentioned above. Decimal SP 28b (7 digits) numbers always use the rounding path as they have been • Binary QP 113b aligned to their target quantum in the shifter step. For 88 binary numbers, subtraction can lead to the loss of one or G. Binary Multiplier more most significant bits and requires in this case The binary quad precision multiplier is depicted in normalization of the result instead of rounding.

Quad Precision Floating Point on the IBM Z13™

Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers]

Chapter 2 DIRECT METHODS—PART II

Decimal Floating Point for Future Processors Hossam A

A Decimal Floating-Point Speciftcation

Floating Point Numbers

High Performance Hardware Design of IEEE Floating Point Adder in FPGA with VHDL

Names for Standardized Floating-Point Formats

SPARC64™ X / X+ Specification

Numerical Computation Guide

In Using the GNU Compiler Collection (GCC)

Cycle-Accurate Evaluation of Software-Hardware Co-Design of Decimal

Floating Point Arithmetic Chapter 14