<<

2016 IEEE 23nd Symposium on Computer Arithmetic

Quad Precision Floating Point on the IBM z13™

Cedric Lichtenau1) Steven Carlough2) Silvia Melitta Mueller1) [email protected] [email protected] [email protected]

1) IBM Deutschland Research and Development GmbH, Schoenaicher Strasse 220, 71032 Boeblingen, Germany 2) IBM Systems and Technology Group, 2455 South Road Poughkeepsie, New York 12601, USA .

Engine (DQE), a divide and square root engine, a short Abstract—When operating on a rapidly increasing amount latency fixed-point engine, and a vector fixed- of data, business analytics applications become sensitive to point and string engine which support IBM’s new Single rounding errors, and profit from the higher stability and faster convergence of quad precision floating-point (FP-QP) Instruction Multiple Data (SIMD) architecture for IBM z arithmetic. The IBM z13™ supports this emerging trend Systems™ [7][13]. There are also two load/store pipelines around Big Data with an outstanding FP-QP performance. to read/write the Vector Register File (VRF) and two FXU The paper details the vector and floating-point unit of IBM read and write ports to access the VRF for the General z13™, with special focus on binary FP-QP. Except for divide Purpose Register File of the scalar fixed-point units. and square root, these instructions are executed in the Each VFU pipeline is a total of 10 cycles deep, two decimal engine. To operate such an 8-cycle decimal and cycles of which are for operand bypass and result quad precision pipeline at 5GHz required innovation around forwarding that do not impact the latency of when results exponent handling, normalization, and rounding. are available to dependent operations. The depth of the pipeline was determined by the longest pipeline depth, Keywords—Floating Point Unit, Quad Precision, Analytics, which is the Decimal and Quad-Precision Engine (DQE). Big Data, Decimal, Binary, IBM z13™ When operations complete in shorter pipelines, the results are sent to a forwarding network where they can be I. INTRODUCTION sourced by dependent operations . The IBM z Systems has been the only commercial Section 2 gives an overview of the Decimal and Quad platform to support quad precision floating point in Precision Engine followed by some details in Section 3 on hardware for the past few decades. Quad precision the Arithmetic Engine of the DQE which performs the floating-point has traditionally been used in physics [1][2] adding, rounding and some post corrections. Section 4 or in computational arithmetic applications [3][4][5]. describes the exponent path of the decimal and quad However, the demand for quad precision floating-point precision engine. Section 5 describes the binary divide and operations is expected to grow dramatically as big data and square root engine. Section 6 provides the latency and business analytic workloads become more prevalent. For throughput numbers for the binary quad precision floating- large installations it has recently been observed that on point operations, and gives a comparison to prior design commercial products like ILOG and SPSS, replacing implementations. double precision operations with quad-precision operations in critical routines yield 18% faster convergence due to II. PIPELINED DECIMAL AND QUAD PRECISION ENGINE reduced rounding error. Convergence of analytics and big In prior zSystems designs, quad precision binary and data algorithms can be improved with increased accuracy hexadecimal floating-point operations were executed in a of the intermediate computations [6]. This is driving binary floating-point unit [8]. The dataflow of these binary investment in the development of faster quad precision floating-point units (BFU) is optimized to execute double binary floating-point hardware. precision operations, so quad precision operations had to The Vector and Floating Point Unit (VFU) is the main loop several times through these binary floating-point units execution engine of the IBM z13™ processor shipped in to complete their calculations. This resulted in longer March 2015. Manufactured in IBM’s 22 nm technology, latencies and lower throughput for quad precision the VFU supports a core frequency of 5 GHz and supports operations compared to their double precision counterparts 2-way simultaneous multi-threading. The design point of on these systems. Furthermore, the additional hardware the VFU was chosen to maximize the performance on necessary for binary floating-point units to support quad workloads for Business Analytics and big data, which precision operations increased the area and pipeline depth includes an ever increasing amount of quad precision of the unit, making it costly if they are replicated to floating point operations. support vector floating-point operations. The novel part of The execution engines in the VFU comprise two the z13™ processor is that quad precision binary floating- symmetrical pipes, where each pipe contains a binary point (BFP) and hexadecimal floating-point (HFP) floating point unit (BFU), a Decimal and Quad Precision operations were implemented on a modified decimal

1063-6889/16 $31.00 © 2016 IEEE 87 DOI 10.1109/ARITH.2016.26 floating-point engine. Since the decimal floating-point The mantissa of the decimal engines are typically designed to execute quad precision floating-point operands are unpacked into binary coded decimal floating-point operations [9] it has a 140-bit wide decimal format and the mantissa of binary and mantissa dataflow, which is wide enough to support the hexadecimal floating-point operands are aligned to the execution of quad precision BFP and HFP operations decimal data path in this hardware. without the need to loop back through the hardware. The The exponent and shift amount calculations start in Decimal and Quad precision Engine (DQE) on the z13™ parallel with the unpacking of the mantissa and the is a pipelined decimal floating point engine that has been detection and handling of special cases such as or augmented and enhanced to support quad precision BFP infinity. Depending on the exponent difference the input and HFP operations. These changes not only resulted in operands are swapped. This puts the larger operand on the quad precision BFP and HFP operations executing faster left side of the data path and the smaller one on the right on z13™ then on predecessor machines, but had the added side. The shift amount logic computes a left and right shift advantage that area and power could be reduced in the amount for the two swapped operands. The shift amounts BFU making it more efficient to support vector BFP are then used in the next pipeline stages. Note that the shift hardware. amount calculation for decimal floating-point is much A block diagram of the DQE pipeline is shown in more complicated than for binary floating-point. Besides Figure 1. The hardware is designed to run at a system the exponent difference, the decimal shift amount frequency of 5GHz, so some of the dataflow hardware calculation also includes the leading zero count of the blocks illustrated span over two cycles to close timing. input operands, as described in [9]. For binary addition and The DQE is an 8-stage fully pipelined unit capable of subtraction, only the mantissa with the smaller exponent is starting a new operation every cycle for all arithmetical shifted; it is shifted to the right by the exponent difference. operations but multiply, divide and conversions between the binary and decimal radixes. B. Shifter The shift operation is executed in two stages, producing 141b wide data. The first stage shifts by digits as required for decimal and hexadecimal numbers; it shifts the operands up to 36 digits to the left or right. The second stage performs the final right bit shift (0 to 3 bits) for binary numbers. This leaves enough timing headroom to include the multiplexing of the multiply, divide and convert results as well as to invert the second operand for effective subtract operations, and add padding to single and double precision numbers. The padding ensures that for all precisions and radices, the overflow of the mantissa after adding and rounding can be detected at the same position. Without this padding, we would have to collect the mantissa overflow information from six bit positions. The padding therefore saves additional delay along the critical path thru the adder.

C. Arithmetical Engine Figure 1. Dataflow of the DQE Unit The Arithmetical Engine (AREN) is the heart of the DQE. It computes sum and absolute difference of the inputs and can be configured to include injection rounding, Multi-cycle operations, like quad precision multiply, to perform a post rounding correction. Furthermore, it reuse the unit dataflow and loop in specific stages of the evaluates result information used for the setting of the pipeline to reduce the size of multiplication hardware IEEE exceptions. The rounding injection happens at two necessary to satisfy area and power budgets [12]. positions in parallel to account for a potential mantissa overflow. The post rounding correction performs a one A. Unpack and Swap digit (decimal numbers) or one bit (binary numbers) The DQE supports seven floating-point data types with correction shift. Details are provided in section III. different mantissa width:

• Decimal: QP 134b (34 digits) D. Normalize and Round DP 64b (16 digits) This step consists of two parallel circuits. One circuit SP 28b (7 digits) normalizes the result of the arithmetical engine, while the • Hexadecimal QP 112b (28 digits) other circuit selects the appropriate rounded result which is DP 56b (14 digits) the post rounding correction mentioned above. Decimal SP 28b (7 digits) numbers always use the rounding path as they have been • Binary QP 113b aligned to their target quantum in the shifter step. For

88 binary numbers, subtraction can lead to the loss of one or G. Binary Multiplier more most significant bits and requires in this case The binary quad precision multiplier is depicted in normalization of the result instead of rounding. It is shown Figure 2. In each cycle, it processes 18 bits of the in a later section that for the pipelined operations multiplier mantissa, generating 9 booth–recoded partial supported by the DQE, either normalization or rounding of products. The intermediate result is accumulated in the results is required, but both functions are never used redundant carry save format, retiring 18 bits of the sum together by the same instruction. and carry vectors per cycle. The AREN performs the E. Pack addition of the 226-bit wide carry and sum vectors first on the low part and then on the high part while accounting for In this last step special values like infinity or NaN are the carry from the low part. Finally the result mantissa is forced on the data path if necessary and the result is shifted appropriately in the shifter block before the packed in to the target format. Binary coded are rounding takes place in the AREN. The shifting stage packed back to the densely packed decimal format of takes care of corrections for subnormal operands and decimal floating-point numbers. For binary floating-point results. numbers, the implied bit is eliminated and for subnormal The binary multiplier is also used for the convert results an exponent correction is applied. When the binary decimal to binary. As described in [9], the convert is done result is provided by the normalizer, a potential shift in an iterative fashion, converting three digits at a time, amount correction is applied as well, prior to packing the and adding them to the intermediate result multiplied by result. The IEEE exception flags and the DXC exception 1000. The core of that function reuses the compressor of code are also generated in this pipeline stage. the binary multiplier, which compresses the 3 new decimal F. Decimal Multiplier, Divider and Converts digits with the 3 shifted terms of sum and carry used to multiply it by 1000 as illustrated in [9]. The decimal multiplication, division, and decimal- binary convert functions use the same algorithms as in the III. THE ARITHEMETIC ENGINE prior machines, IBM zEnterprise z196TM and zEC12TM. A full-length description can be found in [9]. However, one The arithmetical engine is depicted below in Figure 3. change was made in the implementation of the decimal to It consists of a compound adder supporting 37 decimal binary conversion, reusing the binary multiplier structure digits or 119 bits binary mantissa including round, guard, for reduction and accumulation of the conversion terms. and sticky positions and a leading zero anticipator circuit (LZA). For effective subtraction, the second operand is inverted before entering the adder and the LZA.

Figure 3. Arithmetical Engine Circuit

The compound adder produces a sum (H0) and sum+1 (H1) as well as an inverted sum (HC) to be used in the rounding select step. The result HN is a binary or hexadecimal mantissa used by the normalizer. The LZA anticipates the number of zeros in the result HN to be used as normalization shift amount. The LZA runs in parallel to Figure 2. Binary Multiplier Circuit the result computation. The count may be one too large and the normalization shift must be adjusted before delivering the final result.

89

All three floating-point data types use a sign- B. Injection Rounding magnitude representation for their mantissa. An effective The rounding is implemented by applying injection subtraction therefore has to compute the absolute rounding to the adder. It uses a similar scheme as difference of A and B; its result is either A-B or B-A introduced for binary numbers in [15]. The scheme is depending on which number is bigger. The end-around- extended to support rounding for binary and decimal carry (eac) [14] determines which case applies. Without numbers. rounding, the result is as follow: Figure 5. depicts the decimal case. After the swap stage, the exponent of operand A is larger or equal to the exponent of operand B. This implies that only the mantissa of B could have been shifted into the guard and sticky position. Rounding can happen on the first or second rounding point, depending whether the sum or absolute For binary, it is well known that in case of B > A the difference of digits 0 to 36 has a mantissa overflow, i.e., absolute difference of A can be expressed by B-A = that the most significant digit is non-zero. For rounding at !(A+!B), where !X denotes the bitwise inversion of X: the second round point, the injection needs to be applied to B-A = - (A -B-1) -1 䳓 B-A = - (A + !B) -1 digits 33 to 36. Digit 33 is the tricky part, because there 䳓 B-A = !(A + !B) are now three terms, a digit from A and B and the injection term. As described in [15], a 2-to-2 CSA (carry save Thus, the result HN for the normalizer comes from adder) compression is applied to the operands A and B, to either the sum H0 or the sum+1 (H1) based on eac and the create space for the injection. For binary, a single bit wide effective operation: hole was enough, but for decimal we do need a 4-bit wide hole. That requires a special decimal CSA block. On the digits 0 to 33, a regular 3-to-2 CSA which adds ai + bi + 6, accounts for the 6-correction. For digit 34, a full addition of A and B is performed with a potential 6 or 12 correction.

A. Decimal Binary Compound Adder After the CSA compression, we use a compound adder Figure 4. shows the internal of the compound adder. to compute result and result plus one for digit 0 through The arithmetic is performed on a per digit base, using 4-bit 33. In parallel the injection rounding value for the first and adders to compute the digit generate and propagate signals. second rounding points are added to digit 34-36 to produce These signals are fed into two regular binary carry trees, two injection carry-outs Cj and Ck. one with carry-in 0 the other with carry-in 1. The carry vectors are used to select the appropriate digits for the vectors H0, H1, and HC. The digits sum require pre and post corrections [16][17]. For decimal addition, a 6 is added to each digit of the B operand. The digit sums are then post-corrected by either applying -6 in the case of an effective addition without carry-out, or a +6 in the case of effective subtraction with carry-out. For binary operations, the 6- corrections are suppressed. The binary support is drawn in bold in the Figure.

Figure 5. Injection Rounding for Decimal Numbers

By selecting different pre-computed injection values for decimal and binary radix and mapping the binary guard and sticky bits, the same data path can be reused for both radix without adding any delay along the critical path (see Figure 6. for binary injection mapping). The round, guard, Figure 4. Compound Adder Circuit and sticky bits are spread onto the digits 34 to 36 and are padded. The same happens to the binary injection bits. The

90 mapping and padding are designed so the arithmetic can be performed with decimal adders, and carries still propagate as desired.

Figure 7. Addition with Injection-Rounding

Claim: A second carry-out for RK is not possible on Figure 6. Mapping of Binary Injection Rounding to the Decimal Path addition in homogenous precision. There are two cases where we have an overflow and TABLE I. below shows how the rounded result is must round at the 2nd rounding point (see Figure 8. ). gene-rated out of the carry-outs of the compound adder (c0, c1) and the carry-outs from the injection at the first (Cj) and second rounding point (Ck). This applies for all effective add cases and effective subtract cases with A > B.

TABLE I. ROUNDED RESULT SELECTION

Figure 8. No 2nd Carry-out Proof

In the first case, addends A and B have the same exponent (eA=eB). Both mantissas align so that the guard bit and sticky bit are zero. This leads to no carry-in being generated. For A and B being all ones, the addition results is N ones followed by a zero. The result is exact and hence no increment occurs in the rounding step.

In the second case, operand A has a larger exponent For effective subtract with B > A, we need the eac to than operand B (eA > eB). This implies that B got shifted select between A-B and B-A. For this case, neither to the right, and after shifting has at least one leading zero. decimal nor binary require rounding. For decimal, Assuming the largest possible mantissas for A and B (A is rounding only applies when B is shifted right. This implies all ones, B is 0 followed by all ones), the addition that A was fully normalized and B has at least one leading produces a carry into the most significant bit position of A, zero after alignment. Thus, for a decimal subtract with and the leading bits add up to a zero plus a carry-out. In B>A there is no rounding, and the select logic selects HC this case the unrounded result has at least one zero prior to based on the eac. For binary, B>A implies that both the most significant bit. When rounding up the carry will operands had the same exponent; nothing was shifted into only propagate to that position; hence also in this case a the guard and sticky, and no rounding is required. The second carry-out cannot occur, allowing us to do the result is passed down the normalizer path for shifting out addition and the injection rounding in one step and the leading zeros. produce the correct rounded result. It must be shown that the addition and the injection IV. NORMALIZER AND ROUNDER LOGIC rounding can be combined in one step while still providing the same result as if the intermediate result were first For decimal add and subtract, the operands are aligned rounded and then normalized. in the first pipeline stage according to the preferred quantum. Even in the case of massive cancellation no normalization occurs. On an effective addition, the addition or the subsequent rounding can create an extra digit to the left. When exceeding the target width of the mantissa (significant overflow) the AREN together with

the rounder selection perform a 1-digit shift and the Furthermore, it must be shown, when rounding occurs quantum needs to be incremented by one. On an effective it is not possible to get more than one additional digit subtraction, massive cancellation can occur, but the result (decimal number) or bit (binary number). For simplicity does not get fully normalized. To match the preferred this discussion will only focus on the binary case (see quantum, even in case of effective subtraction, only a 1- Figure 7. ). However it can easily be extended to the radix- digit right shift is sufficient provided the operands are 10 case. properly aligned. Thus, add and subtract operations both

flow through the AREN and its rounding circuitry into the packer. More complex operations like decimal divide,

91 decimal multiply, and converts end in a round operation, relative to that of A, and sticks out to the right. The sum which again passes through AREN and rounder as multi- can gain a leading bit, and then has one or two bits before cycle operations. the binary point. Thus, the sum will require rounding, but With regards to normalization, the rules for binary at most a 1-bit normalization. This can be done by the floating-point arithmetic are different from decimal AREN and rounding circuitry. Like in case 1a, the result floating-point. Unless the result of a binary floating-point can overflow but not underflow. operation is a subnormal number, the mantissa of the result must be fully normalized and represented in the format 1.f Case 1sub: eA = eB * 2e, with an implicit leading one before the binary point. Since both numbers have the same exponent, no When looking at the golden rule of computation from alignment shift is required. The absolute difference of the the IEEE 754 standard [18], the binary floating-point two mantissas may lose one or more bits, but it cannot addition and subtraction, align the mantissas of the two gain any extra leading bits. Thus, the difference is exact operands, compute the sum or absolute difference, then and does not require any rounding. However, it might normalize the exact result and round it to the target require normalization and can cause an underflow. This precision. In a fully pipelined fashion, the DQE hardware case will use the normalizer path. can either perform the normalization or the rounding Case 2sub: eA = eB + 1, result keeps most significant bit including a potential 1-bit shift. However, doing both a full normalization and a subsequent rounding would require an Note that eA eMIN, and operand A is a normal extra pipeline stage. It can be shown that for homogenous number. The mantissa of B is aligned relative to the precision arithmetic, the binary floating-point add and mantissa of A and sticks out one bit to the right. In case subtract operations either require a wide normalization, or that there is no cancellation, i.e., the mantissa of the rounding, but not both. It also holds that an IEEE difference still has a leading one before the binary point, Overflow can only occur when the rounder is used, then the mantissa of the difference is p+1 bits wide and whereas an IEEE underflow can only occur when the requires rounding to the target precision p. However, it normalizer is used. This allowed for a simpler exponent does not require any normalization. data flow, and let us implement the normalizer and Case 3sub: eA = eB + 1, result lost most significant bit rounding selection hardware in parallel. The mantissa of A and B are aligned as in the previous Proof: Let A and B be the binary floating-point case, but cancellation occurs. This implies that the numbers with exponent eA and eB, respectively. Let NMIN mantissa of the difference is of the form 0.f, and the be the smallest positive normal number and eMIN be its fraction f has p bits. Thus, the difference does not require exponent. any rounding, just normalization of the leading zeros. Case 1add: eA = eB, A, B NMIN Case 4sub: eA eB + 2 In this case the mantissa of both numbers are of the Operand A is a normal number and its mantissa has the form 1.f and the sum of the mantissas can gain an form 1.f. The B operand can be a normal or subnormal additional leading bit. If the sum does not gain an extra number. The alignment shifts the mantissa of B at least 2 bit, the result is exact and does neither need rounding nor positions to the right. Thus, the aligned mantissa of B is of the form 0.0g. The exact difference D of A and B is D = normalization. If the sum gains an extra bit, it requires eA eA eA rounding, but no normalization since there are no leading 2 * (1.f – 0.0g) 2 * (1.0 – 0.5) = 2 * 0.5. Thus the zeros. In either case, the result can be processed by the mantissa of the difference loses at most one leading bit, rounding circuit. Note that A+B A, B NMIN, and thus and its mantissa has at least p+1 bits. The difference no underflow can occur. requires rounding and at most a 1-bit shift. Both can be handled by the AREN and the rounding circuitry. In this Case 2add: eA = eB, A, B subnormal case, neither overflow nor underflow can occur. In this case the mantissa of both numbers are of the form 0.f and the result cannot be greater than 1.g. The The implementation of the normalizer and rounder intermediate sum does not gain any additional bits beyond path is shown in Figure 9. It consists of a rounding path the integer bit; the result is exact and does not require that will select the correctly rounded result at the proper rounding. However, depending on the setting of the rounding position as discussed in the previous section. For Underflow-enable mask, the result mantissa either stays in subtraction, we preshift the operands in the shifter step by subnormal form or does get fully normalized. Either way, one bit to the left for binary numbers or four bits to the left this operation can be performed by the normalizer. Note for decimal numbers. During subtraction it is possible to that the result is at most 2*NMIN and thus overflow cannot loose one or more bits through cancellation, but additional occur, only underflow. bits cannot be gained. The preshift operation allows us implement a shift left one position shifter for addition and Case 3add: eA > eB eMIN subtraction. The second path through the normalizer is Note that A is a normal number and its mantissa is of used for binary results that need to be normalized by more the form 1.f. The mantissa of the B operand is aligned than one bit.

92

The exponent and control logic delivers a sigovf signal that is one if the intermediate result, RRes, is binary with a most significant bit that is greater than zero, or if the intermediate result is decimal with a most significant digit that is greater than zero. If e + sigovf is less or equal to the maximum exponent eMAX, then the result exponent is e + sigovf and the result mantissa is either RRes if sigovf is Figure 9. Normalizer & Rounder Circuit zero or RRes shifted by one bit for a binary number, or RRes shifted by one digit for a decimal number. If e equals The final result exponent and mantissa are computed eMAX and sigovf is set, then the result has overflowed. If either along the normalizer or the rounder path. The overflow enable is set, the exponent is rebiased, the result normalizer path is select by the following equation based mantissa is computed as before and overflow is indicated. on the proof above: If overflow enable is not set, the result overflows to infinity; the exponent and mantissa of the result are set to that special value.

A result produced by the normalizer path (i.e. norm_sel=1) cannot overflow as shown above. Let LZA be the shift amount anticipated by the LZA circuit and LZA2large the correction derived from the most significant bit of the output of the normalizer circuit. Let e = MAX { eA, eB }. The third leg of the multiplexer is also used to force the The result exponent must be adjusted by LZA - maximum number or NaNs for special cases. LZA2large if it is greater than or equal to eMIN. The result mantissa is shifted by that amount as well. If the resulting V. DIVIDE AND SQUARE ROOT ENGINE exponent is less than eMIN and underflow enable is set, Each of the two pipelines in the VFU has a divide / then the exponent is rebiased, the result mantissa is fully square root engine. It is a standalone engine, which normalized and the underflow flag is set. If the resulting supports all binary and hexadecimal floating-point data exponent is less than eMIN and the underflow enable is not types, i.e.: single double, and quad precision. The set then the result is a subnormal number. The exponent advantage of a standalone engine is that the z13 can freely result e’ is zero and the mantissa is shifted by e-eMIN. start new VFU instructions in the other engines, including the DQE, while these multi-cycle operations are ongoing. The underlying algorithm of the engine is SRT, generating 3 bits per cycle for divide and 2 bits per cycle for square root. The mantissa of the quad precision operations is 113 bits plus some extra bits for rounding. A major challenge was to perform an SRT step on such a wide mantissa and fit it in a single 5GHz cycle. Using a partially redundant number format was key to meeting that challenge.

The engine supports subnormal binary and For rounding results (i.e. norm_sel=0), underflow unnormalized hexadecimal numbers [19] in hardware. cannot occur. In the first step the correctly rounded result Upon receiving the operands, it checks whether the is selected based on if the operation is adding, operating on mantissa of one of the operands is unnormalized, counts binary number, or has an end-around-carry for decimal the number of leading zeros, and fully normalizes the numbers. operands. It then initializes the SRT engine, and runs for a defined number of iterations generating two or three new result bits in each cycle. The iteration count is based only the operation and precision. Though subnormal results may require fewer bits to be computed, allowing the

93

operation to complete in fewer iterations. These cases are REFERENCES rare in commercial applications, and adding hardware to [1] G. Lake, T. Quinn and D.C. Richardson, From Sir Isaac to the support early out cases would only increase the interface Sloan survey: “Calculating the structure and chaos due to gravity complexity with virtually no performance benefit. in the universe," Proc. of the 8th ACM-SIAM Symposium on Discrete Algorithms, 1997, pg. 1-10. VI. RESULTS [2] P.H. Hauschildt and E. Baron, “The numerical solution of the expanding stellar atmosphere problem," Journal Computational The DQE in the IBM z13™ executes quad precision and Applied Mathematics, vol. 109 (1999), pg. 41-63. floating-point operations faster than prior published works. [3] D.H. Bailey, R. Barrio, and J.M. Borwein, “High precision TABLE II. compares the performance of common quad computation: Mathematical physics and dynamics," Applied precision arithmetic operations on the z13™ to that of the Mathematics and Computation, vol. 218 (2012), pg. 10106-10121. prior generation zEC12™ mainframe processor [10]. The [4] Y. He and C. Ding, “Using accurate arithmetic to improve latency of an operation refers to the number of cycles it numerical reproducibility and stability in parallel applications," takes from receiving its operands until its results are Journal of Supercomputing, vol. 18, no. 3 (Mar 2001), pg. 259- 277. available for the next dependent instruction to use. The [5] D. Bailey, “High-Precision Computations: Applications and throughput refers to the average number of cycles after Challenges”, Keynote, 21st Symposium on Computer Arithmetic, which independent instructions can be started on the same April 2013. engine, known as cycles per instruction (CPI). Note that [6] Greenbaum, A., “Iterative Methods for solving Linear Systems”, for z13™ the throughput numbers reflect the fact that there SIAM, Philadelphia, 1997 are two Decimal Quad Precision Engines and two Divide [7] E. Schwarz, “The IBM z13 SIMD Accelerators for Integer, String, Engines in the processor. Unlike prior machines, Divide and Floating-Point”, 22nd Symposium on Computer Arithmetic, and Square Root operations can execute in parallel with 2014. Addition, Subtraction, and Multiplication operations on [8] S. Trong, M. Schmookler, E. Schwarz, M. Kroener, “P6 Binary th the DQE. Furthermore, while DQE and divide Engines are Floating-Point Unit,” Proceedings of the 18 Symposium on Computer Arithmetic, pp. 77-86, June 2007. executing multi-cycle operations, the two 64-bit Binary [9] S. Carlough, S. Mueller, A. Collura, M. Kroener, “The z196 Floating Point Units and two SIMD engines in the z13™ Decimal Floating Point Accelerator”, IEEE Symposium on VFU can continue to execute two instructions every cycle Computer Arithmetic, July 2011, pp. 139-146 [11]. [10] C.k. Shum, F. Busaba and C. Jacobi, “IBM zEC12: The Third- Generation High-Frequency Mainframe Microprocessor”, IEEE Micro (2013), pp. 38-47 TABLE II. PERFORMANCE OF COMMON QUAD PRECISION [11] B.W. Curran, C. Jacobi et all, “The IBM z13 multithreaded OPERATIONS ON Z13™ COMPARED TO PREVIOUS GENERATION ZEC12™ microprocessor”, IBM Journal of Research and Development, Issue 4/5 (July 2015), pp. 2.1-2.16 Latency Latency CPI CPI [12] E. Schwarz et all, “The IBM z13 multithreaded microprocessor”, zEC12™ z13™ zEC12™ z13™ IBM Journal of Research and Development, Issue 4/5 (July 2015), Add/Sub 35 11 28 1.5 pp. 1.1-1.13 Multiply 55-97 23 48-90 7.5 [13] E. Schwarz, R.B. Krishnamurthy, C.J. Parris et al., “The SIMD Divide* ~165 49 ~158 21 accelerator for business analytics on the IBM z13”, IBM Journal of Research and Development, Issue 4/5 (July 2015), pp. 1.1-1.13 Sqrt* ~170 66 ~163 24 [14] P.M. Seidel and G. Even, “Delay-optimized implementation of * Divide/Square Root executed in the Divide Engine, not in the DQE. IEEE floating-point addition”, IEEE Transactions on Computers, 53(2):97-113, 2004.

VII. CONCLUSIONS [15] G. Even and P.M. Seidel, “A comparison of three rounding This paper describes the quad precision floating point algorithms for IEEE floating-point multiplication”, IEEE hardware in the Vector and Floating Point Unit of the IBM Transactions on Computers, 49(7), July 2000. z13™ processor. The hardware was designed to maximize [16] A. Vazquez and E. Antelo, “A High-Performance BCD Adder with IEEE 754-2008 Decimal Rounding”, 19th performance for quad precision floating-point operations Symposium on Computer Arithmetic, 2009. that are occurring with increased frequency on Business [17] L.K. Wang and M.J. Schulte, “Decimal Floating-Point Adder and Analytics workloads, while providing significant Multifunction Unit with Injection-Based Rounding”, 18th performance improvements for existing commercial Symposium on Computer Arithmetic, 2007. workloads. A new decimal and quad precision engine [18] American National Standards Institute and Institute of Electrical (DQE) consists of an 8-stage deep execution pipeline, and and Electronic Engineers. “IEEE Standard for Binary Floating- supports a 5GHz cycle time. An advanced bypass network Point Arithmetic”, ANSI/IEEE Standard 754-1985, 1985. provides early result forwarding to prevent performance [19] International Business Machine, “z/Architecture Principles of loss on dependent operations. The total area of the VFU Operation z13”, SA22-7832-10, 2015. hardware, including the vector and floating point register files, is 3.9mm2 in a 22nm technology.

94