Computer Architecture and System Programming Laboratory

Total Page:16

File Type:pdf, Size:1020Kb

Computer Architecture and System Programming Laboratory Computer Architecture and System Programming Laboratory TA Session 7 x87 FPU x87 Floating-Point Unit (FPU) provides high-performance floating-point processing capabilities • floating-point, integer, and packed BCD integer data types • floating-point processing algorithms • exception handling • IEEE Standard 754 http://home.agh.edu.pl/~amrozek/x87.pdf x87 FPU represents a separate execution environment, consists of 8 data registers and the following special-purpose registers Value loaded from memory into x87 FPU data register is automatically converted into double extended-precision floating-point format x87 FPU instructions treat the eight x87 FPU data registers as a register stack The register number of the current top-of-stack register is stored in the TOP (stack TOP) field in the x87 FPU status word. Load operations decrement TOP by one and load a value into the new top-of- stack register, and store operations store the value from the current TOP register in memory and then increment TOP by one 16-bit x87 FPU status register indicates the current state of the x87 FPU 16-bit tag word indicates the contents of each the 8 registers in the x87 FPU data-register stack (one 2-bit tag per register). Each tag in tag word corresponds to a physical register. TOP pointer is used to associate tags with registers relative to ST(0). var1: dt 5.6 var2: dt 2.4 var3: dt 3.8 gdb command to see var4: dt 10.3 stack data registers: fld tword [var1] ; st0 = 5.6, TOP=4 tui reg float fmul tword [var2] ; st0=st0*2.4=13.44, TOP=4 fld tword [var3] ; st0=3.8, st1=13.44, TOP=3 fmul tword [var4] ; st0=st0*10.3=39.14, st1=13.44, TOP=3 fadd st1 ; st0=st0+st1, st1=13.44, TOP=3 x87 FPU recognizes and operates on the following seven data types: single-precision floating point, double-precision floating point, double extended- precision floating point, signed word integer, signed doubleword integer, signed quadword integer, and packed BCD decimal integers. IEEE 754 standard RAM integer Example: number in 0 mov tword [n], 9 memory … fild tword [n] 0 0 11 9푑 = 1001푏 = 1001.0푏 = −1 ∙ 1.001 ∙ 2 푏 1 0 sign bit = 0 0 exponent = 11 1 significand = 1.001 0 0 … 0 1 1 0 … 0 1 0 0 1 float-point number in x87 data registers stack FPU INSTRUCTION SET x87 FPU instruction set fall into ESC instructions. They have a common opcode format, where the first byte of the opcode is one of the numbers from D8H through DFH. push commonly used constants onto st0 Basic Arithmetic Instructions Example of reverse instruction: Operands in memory can be in single-precision floating-point, double-precision floating-point, word-integer, or doubleword-integer format. They are converted to double extended-precision floating-point format automatically. The pop versions of instructions offer the option of popping the x87 FPU register stack following the arithmetic operation. These instructions operate on values in the ST(i) and ST(0) registers, store the result in the ST(i) register, and pop the ST(0) register. Control Instructions FINIT/FNINIT instructions initialize the x87 FPU and its internal registers to default values. Stack overflow and underflow exceptions Stack overflow — an instruction attempts to load a non-empty x87 FPU register from memory. A non-empty register is defined as a register containing a zero (tag value of 01), a valid value (tag value of 00), or a special value (tag value of 10). Stack underflow — an instruction references an empty x87 FPU register as a source operand, including attempting to write the contents of an empty register to memory. An empty register has a tag value of 11. Magic square http://www.1728.org/magicsq1.htm For the 3 x 3 magic square, each row, each column and both diagonals would sum to 3 • (3² + 1) ÷ 2 = 15 1) '1' goes in the middle of the top row 2) All numbers are then placed one column to the right and one row up from the previous number. 3) Whenever the next number placement is above the top row, stay in that column and place the number in the bottom row. 4) Whenever the next number placement is outside of the rightmost column, stay in that row and place the number in the leftmost column. 5) When encountering a filled-in square, place the next number directly below the previous number. 6) When the next number position is outside both a row and a column, place the number directly beneath the previous number. section .data fs_usage: db "Call with single, positive, odd number", 10, 0 fs_malloc_failed: db "A call to malloc() failed", 10, 0 fs_long: db "%*ld", 0 fs_newline: db 10, 0 section .bss argv: resq 1 n: resq 1 n2: resq 1 a: resq 1 b: resq 1 table: resq 1 width: resq 1 extern printf, atoi, calloc global main section .text main: ; FINIT instruction initialize the x87 FPU and its enter 0, 0 internal registers to default values. The x87 FPU tag finit word is set to FFFFH, which marks all the x87 mov qword [argv], rsi FPU data registers as empty. cmp rdi, 2 ; argc jne .error mov rdi, qword [argv] mov rdi, qword [rdi + 8*1] ; argv[1] call atoi cmp rax, 2 jle .error test rax, 1 ; test rax, 1 tests whether the number is odd. jz .error The equivalent would be to do and rax, 1, but this would change rax. mov qword [n], rax mov rdi, rax mov rsi, 8 call calloc cmp rax, 0 je .malloc_failed mov qword [table], rax mov rdx, rax mov rax, 0 mov rbx, qword [n] .allocate_table: cmp rax, rbx ; check if reach end of table je .fill_table ; if yes, finish allocation and start filling the table mov rdi, rbx mov rsi, 8 ; gdb changes this line to be “mov esi, 8” push rax … push rbx. push rdx … call calloc ; allocate a single row of the table pop rdx … mov qword [rdx], rax pop rbx … pop rax add rdx, 8 add rax, 1 jmp .allocate_table .fill_table_loop: .fill_table: cmp r8, r10 ; i == n^2 mov rbx, 0 ; a = 0 jg .fill_table_done mov r9, qword [n] ; n mov rdi, qword [table] ; rdi = pointer to table mov rcx, r9 mov rdi, qword [rdi + 8 * rbx] ; rdi = pointer to shr rcx, 1 ; b = n / 2 row[rbx] of the table (row 0, then row 1, and then row 2) mov r8, 1 ; i mov rax, r9 mov qword [rdi + 8 * rcx], r8 cdq inc r8 ; r8 = 1,2,3,... mul rax lea rax, [rbx + r9 - 1] mov r10, rax ; n^2 cdq div r9 mov rbx, rdx ` lea rax, [rcx + 1] cdq div r9 mov rcx, rdx mov rdi, qword [table] mov rdi, qword [rdi + 8 * rbx] cmp qword [rdi + 8 * rcx], 0 je .fill_table_loop lea rax, [rbx + 2] cdq div r9 mov rbx, rdx lea rax, [rcx + r9 - 1] cdq div r9 mov rcx, rdx jmp .fill_table_loop fild qword [n] ; FILD (load integer) instruction converts an integer operand in memory into double extended- precision floating-point format and pushes the value onto the top of the register stack. fld st0 ; FLD (load floating point) instruction pushes a floating-point operand from memory onto the top of the x87 FPU data-register stack. fmulp ; Multiply floating point and pop ST(0) from the register stack fxtract ; Extract exponent and significand - put significand in ST(0), and exponent in ST(1) (in binary basic 2) fld1 ; Load +1.0 into ST(0) fxch ; If no source operand is specified, the contents of ST(0) and ST(1) are exchanged fyl2x ; FYL2X instruction computes (y * log2x) ; Replace ST(1) with (ST(1) ∗ log2ST(0)) and pop the register stack. faddp ; Add ST(0) to ST(1), store result in ST(1), and pop the register stack. fldl2t ; Push log210 onto the FPU register stack. fdivp ; Divide ST(1) by ST(0), store result in ST(1), and pop the register stack. ; Indeed we would like to calculate log10x, and not log2x jmp .continue_voodoo .voodoo: dq 1.5 ; add 1.5 to st0, and store at the label width the closest integer to st0 (i.e., rounding it), and pop off the stack .continue_voodoo: fld qword [.voodoo] faddp ; Add ST(0) to ST(1), store result in ST(1), and pop the register stack. ; ST(0)+1.5 fistp qword [width] ; Store ST(0) in m64int and pop register stack. ; Indeed, this rounds the value of ‘width’ because it converts it to integer value ;;; PRINT THE MAGIC SQUARE error: mov rbx, 0 mov rdi, fs_usage mov rax, 0 .outer_loop: call printf cmp rbx, qword [n] jmp .end je .end .malloc_failed: mov rcx, 0 mov rdi, fs_malloc_failed mov rax, 0 .inner_loop: call printf cmp rcx, qword [n] je .end_inner_loop .end: leave mov rdi, fs_long ret mov rsi, qword [width] mov rdx, qword [table] mov rdx, qword [rdx + 8 * rbx] mov rdx, qword [rdx + 8 * rcx] mov rax, 0 push rbx push rcx call printf pop rcx pop rbx inc rcx jmp .inner_loop .end_inner_loop: mov rdi, fs_newline mov rax, 0 push rbx call printf pop rbx inc rbx jmp .outer_loop .
Recommended publications
  • Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu
    Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu To cite this version: Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu. Arithmetic algorithms for extended precision using floating-point expansions. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, 2016, 65 (4), pp.1197 - 1210. 10.1109/TC.2015.2441714. hal- 01111551v2 HAL Id: hal-01111551 https://hal.archives-ouvertes.fr/hal-01111551v2 Submitted on 2 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. IEEE TRANSACTIONS ON COMPUTERS, VOL. , 201X 1 Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldes¸, Olivier Marty, Jean-Michel Muller and Valentina Popescu Abstract—Many numerical problems require a higher computing precision than the one offered by standard floating-point (FP) formats. One common way of extending the precision is to represent numbers in a multiple component format. By using the so- called floating-point expansions, real numbers are represented as the unevaluated sum of standard machine precision FP numbers. This representation offers the simplicity of using directly available, hardware implemented and highly optimized, FP operations.
    [Show full text]
  • Floating Points
    Jin-Soo Kim ([email protected]) Systems Software & Architecture Lab. Seoul National University Floating Points Fall 2018 ▪ How to represent fractional values with finite number of bits? • 0.1 • 0.612 • 3.14159265358979323846264338327950288... ▪ Wide ranges of numbers • 1 Light-Year = 9,460,730,472,580.8 km • The radius of a hydrogen atom: 0.000000000025 m 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 2 ▪ Representation • Bits to right of “binary point” represent fractional powers of 2 • Represents rational number: 2i i i–1 k 2 bk 2 k=− j 4 • • • 2 1 bi bi–1 • • • b2 b1 b0 . b–1 b–2 b–3 • • • b–j 1/2 1/4 • • • 1/8 2–j 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 3 ▪ Examples: Value Representation 5-3/4 101.112 2-7/8 10.1112 63/64 0.1111112 ▪ Observations • Divide by 2 by shifting right • Multiply by 2 by shifting left • Numbers of form 0.111111..2 just below 1.0 – 1/2 + 1/4 + 1/8 + … + 1/2i + … → 1.0 – Use notation 1.0 – 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 4 ▪ Representable numbers • Can only exactly represent numbers of the form x / 2k • Other numbers have repeating bit representations Value Representation 1/3 0.0101010101[01]…2 1/5 0.001100110011[0011]…2 1/10 0.0001100110011[0011]…2 4190.308: Computer Architecture | Fall 2018 | Jin-Soo Kim ([email protected]) 5 Fixed Points ▪ p.q Fixed-point representation • Use the rightmost q bits of an integer as representing a fraction • Example: 17.14 fixed-point representation
    [Show full text]
  • Hacking in C 2020 the C Programming Language Thom Wiggers
    Hacking in C 2020 The C programming language Thom Wiggers 1 Table of Contents Introduction Undefined behaviour Abstracting away from bytes in memory Integer representations 2 Table of Contents Introduction Undefined behaviour Abstracting away from bytes in memory Integer representations 3 – Another predecessor is B. • Not one of the first programming languages: ALGOL for example is older. • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11. • Default version in GCC gnu11 The C programming language • Invented by Dennis Ritchie in 1972–1973 4 – Another predecessor is B. • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11. • Default version in GCC gnu11 The C programming language • Invented by Dennis Ritchie in 1972–1973 • Not one of the first programming languages: ALGOL for example is older. 4 • Closely tied to the development of the Unix operating system • Unix and Linux are mostly written in C • Compilers are widely available for many, many, many platforms • Still in development: latest release of standard is C18. Popular versions are C99 and C11. • Many compilers implement extensions, leading to versions such as gnu18, gnu11.
    [Show full text]
  • Floating Point Formats
    Telemetry Standard RCC Document 106-07, Appendix O, September 2007 APPENDIX O New FLOATING POINT FORMATS Paragraph Title Page 1.0 Introduction..................................................................................................... O-1 2.0 IEEE 32 Bit Single Precision Floating Point.................................................. O-1 3.0 IEEE 64 Bit Double Precision Floating Point ................................................ O-2 4.0 MIL STD 1750A 32 Bit Single Precision Floating Point............................... O-2 5.0 MIL STD 1750A 48 Bit Double Precision Floating Point ............................. O-2 6.0 DEC 32 Bit Single Precision Floating Point................................................... O-3 7.0 DEC 64 Bit Double Precision Floating Point ................................................. O-3 8.0 IBM 32 Bit Single Precision Floating Point ................................................... O-3 9.0 IBM 64 Bit Double Precision Floating Point.................................................. O-4 10.0 TI (Texas Instruments) 32 Bit Single Precision Floating Point...................... O-4 11.0 TI (Texas Instruments) 40 Bit Extended Precision Floating Point................. O-4 LIST OF TABLES Table O-1. Floating Point Formats.................................................................................... O-1 Telemetry Standard RCC Document 106-07, Appendix O, September 2007 This page intentionally left blank. ii Telemetry Standard RCC Document 106-07, Appendix O, September 2007 APPENDIX O FLOATING POINT
    [Show full text]
  • Unit – I Computer Architecture and Operating System – Scs1315
    SCHOOL OF ELECTRICAL AND ELECTRONICS DEPARTMENT OF ELECTRONICS AND COMMMUNICATION ENGINEERING UNIT – I COMPUTER ARCHITECTURE AND OPERATING SYSTEM – SCS1315 UNIT.1 INTRODUCTION Central Processing Unit - Introduction - General Register Organization - Stack organization -- Basic computer Organization - Computer Registers - Computer Instructions - Instruction Cycle. Arithmetic, Logic, Shift Microoperations- Arithmetic Logic Shift Unit -Example Architectures: MIPS, Power PC, RISC, CISC Central Processing Unit The part of the computer that performs the bulk of data-processing operations is called the central processing unit CPU. The CPU is made up of three major parts, as shown in Fig.1 Fig 1. Major components of CPU. The register set stores intermediate data used during the execution of the instructions. The arithmetic logic unit (ALU) performs the required microoperations for executing the instructions. The control unit supervises the transfer of information among the registers and instructs the ALU as to which operation to perform. General Register Organization When a large number of registers are included in the CPU, it is most efficient to connect them through a common bus system. The registers communicate with each other not only for direct data transfers, but also while performing various microoperations. Hence it is necessary to provide a common unit that can perform all the arithmetic, logic, and shift microoperations in the processor. A bus organization for seven CPU registers is shown in Fig.2. The output of each register is connected to two multiplexers (MUX) to form the two buses A and B. The selection lines in each multiplexer select one register or the input data for the particular bus. The A and B buses form the inputs to a common arithmetic logic unit (ALU).
    [Show full text]
  • S.D.M COLLEGE of ENGINEERING and TECHNOLOGY Sridhar Y
    VISVESVARAYA TECHNOLOGICAL UNIVERSITY S.D.M COLLEGE OF ENGINEERING AND TECHNOLOGY A seminar report on CUDA Submitted by Sridhar Y 2sd06cs108 8th semester DEPARTMENT OF COMPUTER SCIENCE ENGINEERING 2009-10 Page 1 VISVESVARAYA TECHNOLOGICAL UNIVERSITY S.D.M COLLEGE OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE ENGINEERING CERTIFICATE Certified that the seminar work entitled “CUDA” is a bonafide work presented by Sridhar Y bearing USN 2SD06CS108 in a partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science Engineering of the Visvesvaraya Technological University, Belgaum during the year 2009-10. The seminar report has been approved as it satisfies the academic requirements with respect to seminar work presented for the Bachelor of Engineering Degree. Staff in charge H.O.D CSE Name: Sridhar Y USN: 2SD06CS108 Page 2 Contents 1. Introduction 4 2. Evolution of GPU programming and CUDA 5 3. CUDA Structure for parallel processing 9 4. Programming model of CUDA 10 5. Portability and Security of the code 12 6. Managing Threads with CUDA 14 7. Elimination of Deadlocks in CUDA 17 8. Data distribution among the Thread Processes in CUDA 14 9. Challenges in CUDA for the Developers 19 10. The Pros and Cons of CUDA 19 11. Conclusions 21 Bibliography 21 Page 3 Abstract Parallel processing on multi core processors is the industry’s biggest software challenge, but the real problem is there are too many solutions. One of them is Nvidia’s Compute Unified Device Architecture (CUDA), a software platform for massively parallel high performance computing on the powerful Graphics Processing Units (GPUs).
    [Show full text]
  • Extended Precision Floating Point Arithmetic
    Extended Precision Floating Point Numbers for Ill-Conditioned Problems Daniel Davis, Advisor: Dr. Scott Sarra Department of Mathematics, Marshall University Floating Point Number Systems Precision Extended Precision Patriot Missile Failure Floating point representation is based on scientific An increasing number of problems exist for which Patriot missile defense modules have been used by • The precision, p, of a floating point number notation, where a nonzero real decimal number, x, IEEE double is insufficient. These include modeling the U.S. Army since the mid-1960s. On February 21, system is the number of bits in the significand. is expressed as x = ±S × 10E, where 1 ≤ S < 10. of dynamical systems such as our solar system or the 1991, a Patriot protecting an Army barracks in Dha- This means that any normalized floating point The values of S and E are known as the significand • climate, and numerical cryptography. Several arbi- ran, Afghanistan failed to intercept a SCUD missile, number with precision p can be written as: and exponent, respectively. When discussing float- trary precision libraries have been developed, but leading to the death of 28 Americans. This failure E ing points, we are interested in the computer repre- x = ±(1.b1b2...bp−2bp−1)2 × 2 are too slow to be practical for many complex ap- was caused by floating point rounding error. The sentation of numbers, so we must consider base 2, or • The smallest x such that x > 1 is then: plications. David Bailey’s QD library may be used system measured time in tenths of seconds, using binary, rather than base 10.
    [Show full text]
  • Lecture 6: Instruction Set Architecture and the 80X86
    Lecture 6: Instruction Set Architecture and the 80x86 Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review From Last Time • Given sales a function of performance relative to competition, tremendous investment in improving product as reported by performance summary • Good products created when have: – Good benchmarks – Good ways to summarize performance • If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins • Time is the measure of computer performance! • What about cost? RHK.S96 2 Review: Integrated Circuits Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = p * ( Wafer_diam / 2)2 – p * Wafer_diam – Test dies Die Area Ö 2 * Die Area Defects_per_unit_area * Die_Area Die Yield = Wafer yield * { 1 + } 4 Die Cost is goes roughly with area RHK.S96 3 Review From Last Time Price vs. Cost 100% 80% Average Discount 60% Gross Margin 40% Direct Costs 20% Component Costs 0% Mini W/S PC 5 4.7 3.8 4 3.5 Average Discount 3 2.5 Gross Margin 2 1.8 Direct Costs 1.5 1 Component Costs 0 Mini W/S PC RHK.S96 4 Today: Instruction Set Architecture • 1950s to 1960s: Computer Architecture Course Computer Arithmetic • 1970 to mid 1980s: Computer Architecture Course Instruction Set Design, especially ISA appropriate for compilers • 1990s: Computer Architecture Course Design of CPU, memory system, I/O system, Multiprocessors RHK.S96 5 Computer Architecture? . the attributes of a [computing] system as seen by the programmer, i.e.
    [Show full text]
  • Low-Power Microprocessor Based on Stack Architecture
    Girish Aramanekoppa Subbarao Low-power Microprocessor based on Stack Architecture Stack on based Microprocessor Low-power Master’s Thesis Low-power Microprocessor based on Stack Architecture Girish Aramanekoppa Subbarao Series of Master’s theses Department of Electrical and Information Technology LU/LTH-EIT 2015-464 Department of Electrical and Information Technology, http://www.eit.lth.se Faculty of Engineering, LTH, Lund University, September 2015. Department of Electrical and Information Technology Master of Science Thesis Low-power Microprocessor based on Stack Architecture Supervisors: Author: Prof. Joachim Rodrigues Girish Aramanekoppa Subbarao Prof. Anders Ard¨o Lund 2015 © The Department of Electrical and Information Technology Lund University Box 118, S-221 00 LUND SWEDEN This thesis is set in Computer Modern 10pt, with the LATEX Documentation System ©Girish Aramanekoppa Subbarao 2015 Printed in E-huset Lund, Sweden. Sep. 2015 Abstract There are many applications of microprocessors in embedded applications, where power efficiency becomes a critical requirement, e.g. wearable or mobile devices in healthcare, space instrumentation and handheld devices. One of the methods of achieving low power operation is by simplifying the device architecture. RISC/CISC processors consume considerable power because of their complexity, which is due to their multiplexer system connecting the register file to the func- tional units and their instruction pipeline system. On the other hand, the Stack machines are comparatively less complex due to their implied addressing to the top two registers of the stack and smaller operation codes. This makes the instruction and the address decoder circuit simple by eliminating the multiplex switches for read and write ports of the register file.
    [Show full text]
  • A Variable Precision Hardware Acceleration for Scientific Computing Andrea Bocco
    A variable precision hardware acceleration for scientific computing Andrea Bocco To cite this version: Andrea Bocco. A variable precision hardware acceleration for scientific computing. Discrete Mathe- matics [cs.DM]. Université de Lyon, 2020. English. NNT : 2020LYSEI065. tel-03102749 HAL Id: tel-03102749 https://tel.archives-ouvertes.fr/tel-03102749 Submitted on 7 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. N°d’ordre NNT : 2020LYSEI065 THÈSE de DOCTORAT DE L’UNIVERSITÉ DE LYON Opérée au sein de : CEA Grenoble Ecole Doctorale InfoMaths EDA N17° 512 (Informatique Mathématique) Spécialité de doctorat :Informatique Soutenue publiquement le 29/07/2020, par : Andrea Bocco A variable precision hardware acceleration for scientific computing Devant le jury composé de : Frédéric Pétrot Président et Rapporteur Professeur des Universités, TIMA, Grenoble, France Marc Dumas Rapporteur Professeur des Universités, École Normale Supérieure de Lyon, France Nathalie Revol Examinatrice Docteure, École Normale Supérieure de Lyon, France Fabrizio Ferrandi Examinateur Professeur associé, Politecnico di Milano, Italie Florent de Dinechin Directeur de thèse Professeur des Universités, INSA Lyon, France Yves Durand Co-directeur de thèse Docteur, CEA Grenoble, France Cette thèse est accessible à l'adresse : http://theses.insa-lyon.fr/publication/2020LYSEI065/these.pdf © [A.
    [Show full text]
  • X86-64 Machine-Level Programming∗
    x86-64 Machine-Level Programming∗ Randal E. Bryant David R. O'Hallaron September 9, 2005 Intel’s IA32 instruction set architecture (ISA), colloquially known as “x86”, is the dominant instruction format for the world’s computers. IA32 is the platform of choice for most Windows and Linux machines. The ISA we use today was defined in 1985 with the introduction of the i386 microprocessor, extending the 16-bit instruction set defined by the original 8086 to 32 bits. Even though subsequent processor generations have introduced new instruction types and formats, many compilers, including GCC, have avoided using these features in the interest of maintaining backward compatibility. A shift is underway to a 64-bit version of the Intel instruction set. Originally developed by Advanced Micro Devices (AMD) and named x86-64, it is now supported by high end processors from AMD (who now call it AMD64) and by Intel, who refer to it as EM64T. Most people still refer to it as “x86-64,” and we follow this convention. Newer versions of Linux and GCC support this extension. In making this switch, the developers of GCC saw an opportunity to also make use of some of the instruction-set features that had been added in more recent generations of IA32 processors. This combination of new hardware and revised compiler makes x86-64 code substantially different in form and in performance than IA32 code. In creating the 64-bit extension, the AMD engineers also adopted some of the features found in reduced-instruction set computers (RISC) [7] that made them the favored targets for optimizing compilers.
    [Show full text]
  • Effectiveness of Floating-Point Precision on the Numerical Approximation by Spectral Methods
    Mathematical and Computational Applications Article Effectiveness of Floating-Point Precision on the Numerical Approximation by Spectral Methods José A. O. Matos 1,2,† and Paulo B. Vasconcelos 1,2,∗,† 1 Center of Mathematics, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal; [email protected] 2 Faculty of Economics, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: With the fast advances in computational sciences, there is a need for more accurate compu- tations, especially in large-scale solutions of differential problems and long-term simulations. Amid the many numerical approaches to solving differential problems, including both local and global methods, spectral methods can offer greater accuracy. The downside is that spectral methods often require high-order polynomial approximations, which brings numerical instability issues to the prob- lem resolution. In particular, large condition numbers associated with the large operational matrices, prevent stable algorithms from working within machine precision. Software-based solutions that implement arbitrary precision arithmetic are available and should be explored to obtain higher accu- racy when needed, even with the higher computing time cost associated. In this work, experimental results on the computation of approximate solutions of differential problems via spectral methods are detailed with recourse to quadruple precision arithmetic. Variable precision arithmetic was used in Tau Toolbox, a mathematical software package to solve integro-differential problems via the spectral Tau method. Citation: Matos, J.A.O.; Vasconcelos, Keywords: floating-point arithmetic; variable precision arithmetic; IEEE 754-2008 standard; quadru- P.B.
    [Show full text]