High Performance Decimal Floating-Point Units
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSIDADE DE SANTIAGO DE COMPOSTELA DEPARTAMENTO DE ELECTRONICA´ E COMPUTACION´ PhD. Dissertation High-Performance Decimal Floating-Point Units Alvaro´ Vazquez´ Alvarez´ Santiago de Compostela, January 2009 To my family A´ mina˜ familia Acknowledgements It has been a long way to see this thesis successfully concluded, at least longer than what I imagined it. Perhaps the moment to thank and acknowledge everyone’s contributions is the most eagerly awaited. This thesis could not have been possible without the support of several people and organizations whose contributions I am very grateful. First of all, I want to express my sincere gratitude to my thesis advisor, Elisardo Antelo. Specially, I would like to emphasize the invaluable support he offered to me all these years. His ideas and contributions have a major influence on this thesis. I would like to thank all people in the Departamento de Electronica´ e Computacion´ for the material and personal help they gave me to carry out this thesis, and for providing a friendly place to work. In particular, I would like to mention to Prof. Javier D. Bruguera and the other staff of the Computer Architecture Group. Many thanks to Paula, David, Pichel, Marcos, Juanjo, Oscar,´ Roberto and my other workmates for their friendship and help. I am very grateful to IBM Germany for their financial support though a one-year research contract. I would like to thank Ralf Fischer, lead of hardware development, and Peter Roth and Stefan Wald, team managers at IBM Deutchland Entwicklung in Boblingen.¨ I would like to extend my gratitude to the FPU design team, in special to Silvia Muller¨ and Michael Kroner,¨ for their help and the warm welcome I received during my stay in Boblingen.¨ I would also like to thank Eric Schwarz from IBM for his support. Many thanks to Paolo Montuschi from Politecnico di Torino for his collaboration in sev- eral parts of this research. Finally, I want to thank the institutions that have financially supported this research through trip grants for attending to different conferences: Universidade de Santiago de Com- postela, Ministerio de Ciencia y Tecnolog´ıa (Ministry of Science and Technology) of Spain under contracts TIN2004-07797-C02 and TIN2007-67537-C03, and Xunta de Galicia under contract PGIDT03-TIC10502PR. “Before computers, I used my 10 fingers, but now ::: what am I supposed to do with the other 9 ?” Anonymous Contents 1 Decimal Computer Arithmetic: An Overview 9 1.1 The evolution of ALUs: decimal vs. binary . 10 1.2 The new financial and business demands . 14 1.3 Decimal floating-point: Specifications, standard and implementations . 18 1.4 Current and future trends . 20 2 Decimal Floating-Point Arithmetic Units 23 2.1 IEEE 754-2008 standard for floating-point . 23 2.2 Decimal floating-point unit design . 29 2.3 Decimal arithmetic operations for hardware acceleration . 31 3 10’s Complement BCD Addition 33 3.1 Previous work on BCD addition/subtraction . 34 3.1.1 Basic 10’s complement algorithm . 35 3.1.2 Direct decimal addition . 37 3.1.3 Speculative decimal addition . 41 3.2 Proposed method: conditional speculative decimal addition . 45 3.3 Proposed architectures . 50 3.3.1 Binary prefix tree architectures . 50 3.3.2 Hybrid prefix tree/carry-select architectures . 53 3.3.3 Ling prefix tree architectures . 59 3.4 Sum error detection . 63 3.4.1 2’s complement binary addition . 65 3.4.2 10’s complement BCD addition . 65 3.4.3 Mixed binary/BCD addition . 69 3.5 Evaluation results and comparison . 70 3.5.1 Evaluation results . 70 3.5.2 Comparison . 74 3.6 Conclusions . 79 4 Sign-Magnitude BCD Addition 81 4.1 Basic principles . 81 4.2 Sign-magnitude BCD speculative adder . 83 4.3 Proposed method for sign-magnitude addition . 86 4.4 Architectures for the sign-magnitude adder . 87 4.4.1 Binary prefix tree architecture . 88 i ii Contents 4.4.2 Hybrid prefix/carry-select architecture . 90 4.4.3 Ling prefix tree architectures . 91 4.5 Sum error detection . 93 4.6 Evaluation results and comparison . 95 4.6.1 Evaluation results . 96 4.6.2 Comparison . 97 4.7 Conclusions . 99 5 Decimal Floating-Point Addition 101 5.1 Previous work on DFP addition . 102 5.1.1 IEEE 754-2008 compliant DFP adders . 102 5.1.2 Significand BCD addition and rounding . 106 5.2 Proposed method for combined BCD significand addition and rounding . 110 5.3 Architecture of the significand BCD adder with rounding . 117 5.3.1 Direct implementation of decimal rounding . 120 5.3.2 Decimal rounding by injection . 122 5.4 Evaluation results and comparison . 123 5.4.1 Evaluation results . 123 5.4.2 Comparison . 124 5.5 Conclusions . 125 6 Multioperand Carry-Free Decimal Addition 127 6.1 Previous Work . 128 6.2 Proposed method for fast carry-save decimal addition . 129 6.2.1 Alternative Decimal Digit Encodings . 130 6.2.2 Algorithm . 131 6.3 Decimal 3:2 and 4:2 CSAs . 132 6.3.1 Gate level implementation . 132 6.3.2 Implementation of digit recoders . 133 6.4 Decimal carry-free adders based on reduction of bit columns . 136 6.4.1 Bit counters . 136 6.4.2 Architecture . 138 6.5 Decimal and combined binary/decimal CSA trees . 138 6.5.1 Basic implementations . 139 6.5.2 Area-optimized implementations . 140 6.5.3 Delay-optimized implementations . 140 6.5.4 Combined binary/decimal implementations . 143 6.6 Evaluation results and comparison . 145 6.6.1 Evaluation results . 145 6.6.2 Comparison . 146 6.7 Conclusions . 147 7 Decimal Multiplication 149 7.1 Overview of DFX multiplication and previous work . 150 7.2 Proposed techniques for parallel DFX multiplication . 151 7.3 Generation of partial products . 152 Contents iii 7.3.1 Generation of multiplicand’s multiples . 152 7.3.2 Signed-digit multiplier recodings . 154 7.4 Reduction of partial products . 158 7.5 Decimal fixed-point architectures . 160 7.5.1 Decimal SD radix-10 multiplier . 160 7.5.2 Decimal SD radix-5 multiplier . 161 7.5.3 Combined binary/decimal SD radix-4/radix-5 multipliers . 161 7.6 Decimal floating-point architectures . 162 7.6.1 DFP multipliers . 165 7.6.2 Decimal FMA: Fused-Multiply-Add . 167 7.7 Evaluation results and comparison . 169 7.7.1 Evaluation results . 170 7.7.2 Comparison . 171 7.8 Conclusions . 172 8 Decimal Digit-by-Digit Division 173 8.1 Previous work . 173 8.2 Decimal floating-point division . 174 8.3 SRT radix-10 digit-recurrence division . 176 8.3.1 SRT non restoring division . 176 8.3.2 Decimal representations for the operands . 177 8.3.3 Proposed algorithm . 179 8.3.4 Selection function . 181 8.4 Decimal fixed-point architecture . 183 8.4.1 Implementation of the datapath . 183 8.4.2 Operation sequence . 186 8.4.3 Implementation of the selection function . 187 8.4.4 Implementation of the decimal (5421) coded adder . 190 8.5 Evaluation and comparison . 192 8.5.1 Evaluation results . 192 8.5.2 Comparison . 193 8.6 Conclusions . 194 Conclusions and Future Work 195 A Area and Delay Evaluation Model for CMOS circuits 199 A.1 Parametrization of the Static CMOS Gate Library . 200 A.2 Path delay evaluation and optimization . 205 A.3 Optimization of buffer trees and forks . 207 A.4 Area and delay estimations of some basic components . 210 Bibliography 213 iv Contents List of Figures 1.1 Example of a decimal tax calculation using binary floating-point. 17 2.1 DFP interchange format encodings. ............................ 26 2.2 Configurations for the architecture of the DFU. ..................... 30 3.1 10’s complement Addition/Subtraction Algorithm. 36 3.2 Direct Decimal Algorithm. 37 3.3 Direct decimal adder with direct BCD sum. ....................... 39 3.4 Mixed binary/direct decimal adder using a hybrid configuration. 40 3.5 Decimal Speculative Algorithm. ..