Ooo FP Execution Engine

Design and implementation of an Out of order execution engine of floating point arithmetic operations Ing. Cristóbal Ramírez Lazo Director: PhD. Osman Sabri Unsal Codirector: PhD. Luis Alfonso Villa Vargas Ponent: PhD. Adrián Cristal Kestelman Barcelona School of Informatics - Universitat Politècnica de Catalunya Computing Research Center - Instituto Politécnico Nacional This thesis is submitted for the degree of Master of Science Feb 2015 Abstract A floating point unit (FPU), also known as a math coprocessor, is a part of a processor to perform operations on floating point numbers. Nowadays, almost all processors include a Floating point unit in the chip, this unit is more complex and consume more area in the chip and for this reason many processors share this unit between a pair of cores. When a CPU is executing a program that calls for floating point operations and this is not supported by the hardware, the CPU emulates it using a series of simpler fixed point arithmetic operations that run on the integer arithmetic logic unit, causing low performance in this kind of applications. The Centro de Investigación en Computación of the Instituto Politécnico Nacional work in a project currently in development called Lagarto to create intellectual property in embedded high performance processor architectures and operating systems to research and teach. Lagarto II is a superscalar processor which fetches, decodes and dispatches up to two in- structions per clock cycle, which will support a complete instruction set of 32-bits that oper- ate in 64-bits data, this architecture is synthesizable in FPGAs devices. In this thesis, work is undertaken towards the design in hardware description languages and implementation in FPGA of an out of order execution engine of floating point arithmetic operations. A first proposal covers the design of a low power consumption issue queue for out of order processors, register bank, bypass network and the functional units for addition/subtraction, multiplication, division/reciprocal and Fused Multiply Accumulate (FMAC) confirming with the IEEE-754 standard. The design supports double precision format and denormalized numbers; A second proposal is based on a pair of FMAC as functional units which can perform almost all Floating-point operations, this design is more beneficial in area, performance and energy efficiency compared with the first version. I Acknowledgments I would like to express my gratitude to my advisors Luis Alfonso Villa, Osman Unsal and professor Marco Antonio Ramírez who shared much of their time and knowledge to I could finish my thesis work. Furthermore, to Professors from the Computer Research Center of IPN and Barcelona School of Informatics of UPC, who walked me through this interesting research line. In addition, I express my gratitude to CONACYT, to Computer Research Center of IPN and the project: SIP: 20150957 "Desarrollo de Procesadores de Alto Desempeño para Sistemas en Chips Programables" who financed part of my master degree. And of course to my family: my parents Cristóbal Ramírez Salinas and Florina Lazo Osorio and my sister Itzel Ramírez Lazo who have always encouraged me to keep going. II Table of contents List of Figures ..................................................................................................................... V List of Tables .................................................................................................................... VII Glossary ............................................................................................................................. IX 1. Introduction ................................................................................................................ 1 1.1. Motivation .......................................................................................................... 1 1.2. Objectives .......................................................................................................... 2 1.3. Justification ........................................................................................................ 2 1.4. Organization ....................................................................................................... 3 2. Background ................................................................................................................. 4 2.1. Superscalar Architectures .................................................................................. 4 2.1.1. Issue Stage ......................................................................................................... 7 2.1.2. Read Register stage ............................................................................................ 9 2.1.3. Execution Stage ................................................................................................. 9 2.1.4. Commit Stage................................................................................................... 11 2.2. Floating Point Numbers ................................................................................... 13 2.2.1. IEEE 754 standard ........................................................................................... 14 3. State of the Art .......................................................................................................... 18 3.1. Issue Queue ...................................................................................................... 19 3.2. Register File ..................................................................................................... 21 3.3. Execution Stage ............................................................................................... 27 3.3.1. Floating Point Adder/Subtractor ...................................................................... 28 3.3.2. Floating Point Multiplier .................................................................................. 31 3.3.3. Floating Point Divider ...................................................................................... 32 3.4. Intel Itanium Floating Point Architecture ........................................................ 37 3.5. AMD Bulldozer Architecture ........................................................................... 38 4. Design and implementation ...................................................................................... 42 4.1. First Proposal ................................................................................................... 43 4.1.1. Issue Queue ...................................................................................................... 43 4.1.2. Register Bank ................................................................................................... 55 4.1.3. Execution Stage ............................................................................................... 58 III 4.1.4. Bypass design................................................................................................... 81 4.1.5. Complete design ............................................................................................... 82 4.2. Second Proposal ............................................................................................... 85 5. Implementation ......................................................................................................... 90 5.1. First Version .................................................................................................... 90 5.1.1. Issue Queue ...................................................................................................... 90 5.1.2. Register Bank ................................................................................................... 91 5.1.3. Execution Stage ............................................................................................... 92 5.1.4. Recovery .......................................................................................................... 96 5.1.5. Complete design ............................................................................................... 96 5.2. Second Version ................................................................................................ 98 5.2.1. Issue Queue ...................................................................................................... 98 5.2.2. Register Bank ................................................................................................... 98 5.2.3. Fused Multiply Accumulate Unit (FMAC) ...................................................... 99 5.2.4. Recovery .......................................................................................................... 99 5.2.5. Complete design ............................................................................................... 99 6. Testing .................................................................................................................... 101 7. Conclusions, Results, Future works and Research’s Products ................................ 107 APPENDIXE A MIPS 64 Revision 6 and the IEEE standard 754 ............................... 110 APPENDIXE B FPU Instruction Set (Release 6)......................................................... 126 References ....................................................................................................................... 130 IV List of Figures Fig. 2.1 Lagarto II Microarchitecture _______________________________________________ 5 Fig. 2.2 Backend and Frontend in a superscalar

Ooo FP Execution Engine

Digital Electronic Circuits

Design and Analysis of Power Efficient PTL Half Subtractor Using 120Nm Technology

Simple Computer Example Register Structure

MIPS IV Instruction Set

Design of Adder / Subtractor Circuits Based On

Area and Power Optimized D-Flip Flop and Subtractor

Advanced Computer Architecture

High Performance Full Subtractor Using Floating-Gate MOSFET

The Central Processor Unit

The Interrupt Program Status Register (IPSR) Contains the Exception Type Number of the Current ISR

Design and Implementation of Adder/Subtractor and Multiplication Units for Floating-Point Arithmetic

Computer Organization and Architecture, Rajaram & Radhakrishan, PHI