NASM – the Netwide Assembler

Total Page:16

File Type:pdf, Size:1020Kb

NASM – the Netwide Assembler NASM – The Netwide Assembler version 2.14rc7 © 1996−2017 The NASM Development Team — All Rights Reserved This document is redistributable under the license given in the file "LICENSE" distributed in the NASM archive. Contents Chapter 1: Introduction . 17 1.1 What Is NASM?. 17 1.1.1 License Conditions . 17 Chapter 2: Running NASM . 19 2.1 NASM Command−Line Syntax . 19 2.1.1 The −o Option: Specifying the Output File Name . 19 2.1.2 The −f Option: Specifying the Output File Format . 20 2.1.3 The −l Option: Generating a Listing File . 20 2.1.4 The −M Option: Generate Makefile Dependencies. 20 2.1.5 The −MG Option: Generate Makefile Dependencies . 20 2.1.6 The −MF Option: Set Makefile Dependency File. 20 2.1.7 The −MD Option: Assemble and Generate Dependencies . 20 2.1.8 The −MT Option: Dependency Target Name . 21 2.1.9 The −MQ Option: Dependency Target Name (Quoted) . 21 2.1.10 The −MP Option: Emit phony targets . 21 2.1.11 The −MW Option: Watcom Make quoting style . 21 2.1.12 The −F Option: Selecting a Debug Information Format . 21 2.1.13 The −g Option: Enabling Debug Information. 21 2.1.14 The −X Option: Selecting an Error Reporting Format . 21 2.1.15 The −Z Option: Send Errors to a File. 22 2.1.16 The −s Option: Send Errors to stdout ..........................22 2.1.17 The −i Option: Include File Search Directories . 22 2.1.18 The −p Option: Pre−Include a File . 22 2.1.19 The −d Option: Pre−Define a Macro . 23 2.1.20 The −u Option: Undefine a Macro . 23 2.1.21 The −E Option: Preprocess Only. 23 2.1.22 The −a Option: Don’t Preprocess At All . 23 2.1.23 The −O Option: Specifying Multipass Optimization . 23 2.1.24 The −t Option: Enable TASM Compatibility Mode . 24 2.1.25 The −w and −W Options: Enable or Disable Assembly Warnings . 24 2.1.26 The −v Option: Display Version Info . 25 2.1.27 The −y Option: Display Available Debug Info Formats. 25 2.1.28 The −−prefix and −−postfix Options. 26 3 2.1.29 The NASMENV Environment Variable . 26 2.2 Quick Start for MASM Users . 26 2.2.1 NASM Is Case−Sensitive. 26 2.2.2 NASM Requires Square Brackets For Memory References. 26 2.2.3 NASM Doesn’t Store Variable Types. 27 2.2.4 NASM Doesn’t ASSUME ...................................27 2.2.5 NASM Doesn’t Support Memory Models . 27 2.2.6 Floating−Point Differences . 27 2.2.7 Other Differences. 27 Chapter 3: The NASM Language . 29 3.1 Layout of a NASM Source Line . 29 3.2 Pseudo−Instructions . 30 3.2.1 DB and Friends: Declaring Initialized Data . 30 3.2.2 RESB and Friends: Declaring Uninitialized Data . 30 3.2.3 INCBIN: Including External Binary Files . 30 3.2.4 EQU: Defining Constants . 31 3.2.5 TIMES: Repeating Instructions or Data. 31 3.3 Effective Addresses . 31 3.4 Constants . 33 3.4.1 Numeric Constants . 33 3.4.2 Character Strings . 33 3.4.3 Character Constants . 34 3.4.4 String Constants . 34 3.4.5 Unicode Strings . 35 3.4.6 Floating−Point Constants. 35 3.4.7 Packed BCD Constants . 36 3.5 Expressions . 36 3.5.1 |: Bitwise OR Operator . 36 3.5.2 ^: Bitwise XOR Operator . 37 3.5.3 &: Bitwise AND Operator . 37 3.5.4 << and >>: Bit Shift Operators . 37 3.5.5 + and −: Addition and Subtraction Operators. 37 3.5.6 *, /, //, % and %%: Multiplication and Division . 37 3.5.7 Unary Operators . 37 3.6 SEG and WRT ..........................................37 3.7 STRICT: Inhibiting Optimization . 38 4 3.8 Critical Expressions . 38 3.9 Local Labels . 39 Chapter 4: The NASM Preprocessor . 41 4.1 Single−Line Macros . 41 4.1.1 The Normal Way: %define ................................41 4.1.2 Resolving %define: %xdefine..............................42 4.1.3 Macro Indirection: %[...].................................43 4.1.4 Concatenating Single Line Macro Tokens: %+........................43 4.1.5 The Macro Name Itself: %? and %?? ............................43 4.1.6 Undefining Single−Line Macros: %undef..........................44 4.1.7 Preprocessor Variables: %assign .............................44 4.1.8 Defining Strings: %defstr .................................45 4.1.9 Defining Tokens: %deftok.................................45 4.2 String Manipulation in Macros. 45 4.2.1 Concatenating Strings: %strcat .............................45 4.2.2 String Length: %strlen ..................................45 4.2.3 Extracting Substrings: %substr ..............................46 4.3 Multi−Line Macros: %macro ...................................46 4.3.1 Overloading Multi−Line Macros. 47 4.3.2 Macro−Local Labels . 47 4.3.3 Greedy Macro Parameters . 48 4.3.4 Macro Parameters Range . 48 4.3.5 Default Macro Parameters . 49 4.3.6 %0: Macro Parameter Counter . 50 4.3.7 %00: Label Preceeding Macro . 50 4.3.8 %rotate: Rotating Macro Parameters . 50 4.3.9 Concatenating Macro Parameters . 51 4.3.10 Condition Codes as Macro Parameters . 52 4.3.11 Disabling Listing Expansion . 52 4.3.12 Undefining Multi−Line Macros: %unmacro ........................53 4.4 Conditional Assembly . 53 4.4.1 %ifdef: Testing Single−Line Macro Existence . 53 4.4.2 %ifmacro: Testing Multi−Line Macro Existence . 54 4.4.3 %ifctx: Testing the Context Stack . 54 4.4.4 %if: Testing Arbitrary Numeric Expressions . 54 4.4.5 %ifidn and %ifidni: Testing Exact Text Identity . 55 5 4.4.6 %ifid, %ifnum, %ifstr: Testing Token Types. 55 4.4.7 %iftoken: Test for a Single Token . 56 4.4.8 %ifempty: Test for Empty Expansion . 56 4.4.9 %ifenv: Test If Environment Variable Exists . 56 4.5 Preprocessor Loops: %rep ...................................56 4.6 Source Files and Dependencies . 57 4.6.1 %include: Including Other Files . 57 4.6.2 %pathsearch: Search the Include Path. 58 4.6.3 %depend: Add Dependent Files . 58 4.6.4 %use: Include Standard Macro Package.
Recommended publications
  • 2. Assembly Language Assembly Language Is a Programming Language That Is Very Similar to Machine Language, but Uses Symbols Instead of Binary Numbers
    2. Assembly Language Assembly Language is a programming language that is very similar to machine language, but uses symbols instead of binary numbers. It is converted by the assembler into executable machine- language programs. Assembly language is machine-dependent; an assembly program can only be executed on a particular machine. 2.1 Introduction to Assembly Language Tools Practical assembly language programs can, in general, be written using one of the two following methods: 1- The full-segment definition form 2- The simplified segment definition form In both methods, the source program includes two types of instructions: real instructions, and pseudo instructions. Real instructions such as MOV and ADD are the actual instructions that are translated by the assembler into machine code for execution by the CPU. Pseudo instructions, on the other hand, don’t generate machine code and are only used to give directions to the assembler about how it should translate the assembly language instructions into machine code. The assembler program converts the written assembly language file (called source file) into machine code file (called object file). Another program, known as the linker, converts the object file into an executable file for practical run. It also generates a special file called the map file which is used to get the offset addresses of the segments in the main assembly program as shown in figure 1. Other tools needed in assembling coding include a debugger, and an editor as shown in figure 2 Figure 2. Program Development Procedure There are several commercial assemblers available like the Microsoft Macro Assembler (MASM), and the Borland Turbo Assembler (TASM).
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • RISC-V Vector Extension Webinar I
    RISC-V Vector Extension Webinar I July 13th, 2021 Thang Tran, Ph.D. Principal Engineer Who WeAndes Are Technology Corporation CPU Pure-play RISC-V Founding Major Open-Source CPU IP Vendor Premier Member Contributor/Maintainer RISC-V Ambassador 16-year-old Running Task Groups Public Company TSC Vice Chair Director of the Board Quick Facts + NL 100 years 80% FR BJ KR USA JP IL SH CPU Experience in Silicon Valley R&D SZ TW (HQ) 200+ 20K+ 7B+ Licensees AndeSight IDE Total shipment of Andes- installations Embedded™ SoC Confidential Taking RISC-V® Mainstream 2 Andes Technology Corporation Overview Andes Highlights •Founded in March 2005 in Hsinchu Science Park, Taiwan, ROC. •World class 32/64-bit RISC-V CPU IP public company •Over 200 people; 80% are engineers; R&D team consisting of Silicon Valley veterans •TSMC OIP Award “Partner of the Year” for New IP (2015) •A Premier founding member of RISC-V Foundation •2018 MCU Innovation Award by China Electronic News: AndesCore™ N25F/NX25F •ASPENCORE WEAA 2020 Outstanding Product Performance of the Year: AndesCore™ NX27V •2020 HsinChu Science Park Innovation Award: AndesCore™ NX27V Andes Mission • Trusted Computing Expert and World No.1 RISC-V IP Provider Emerging Opportunities • AIoT, 5G/Networking, Storage and Cloud computing 3 V5 Adoptions: From MCU to Datacenters • Edge to Cloud: − ADAS − Datacenter AI accelerators − AIoT − SSD: enterprise (& consumer) − Blockchain − 5G macro/small cells − FPGA − MCU − Multimedia − Security − Wireless (BT/WiFi) 5G Macro • 1 to 1000+ core • 40nm to 5nm • Many in AI Copyright© 2020 Andes Technology Corp. 4 Webinar I - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V introduction • Summary Copyright© 2020 Andes Technology Corp.
    [Show full text]
  • NASM — the Netwide Assembler Version 2.09.04
    NASM — The Netwide Assembler version 2.09.04 -~~..~:#;L .-:#;L,.- .~:#:;.T -~~.~:;. .~:;. E8+U *T +U' *T# .97 *L E8+' *;T' *;, D97 `*L .97 '*L "T;E+:, D9 *L *L H7 I# T7 I# "*:. H7 I# I# U: :8 *#+ , :8 T, 79 U: :8 :8 ,#B. .IE, "T;E* .IE, J *+;#:T*" ,#B. .IE, .IE, © 1996−2010 The NASM Development Team — All Rights Reserved This document is redistributable under the license given in the file "LICENSE" distributed in the NASM archive. Contents Chapter 1: Introduction . .15 1.1 What Is NASM? . .15 1.1.1 Why Yet Another Assembler?. .15 1.1.2 License Conditions . .15 1.2 Contact Information . .16 1.3 Installation. .16 1.3.1 Installing NASM under MS−DOS or Windows . .16 1.3.2 Installing NASM under Unix . .17 Chapter 2: Running NASM . .18 2.1 NASM Command−Line Syntax . .18 2.1.1 The −o Option: Specifying the Output File Name . .18 2.1.2 The −f Option: Specifying the Output File Format . .19 2.1.3 The −l Option: Generating a Listing File . .19 2.1.4 The −M Option: Generate Makefile Dependencies . .19 2.1.5 The −MG Option: Generate Makefile Dependencies . .19 2.1.6 The −MF Option: Set Makefile Dependency File . .19 2.1.7 The −MD Option: Assemble and Generate Dependencies. .19 2.1.8 The −MT Option: Dependency Target Name. .20 2.1.9 The −MQ Option: Dependency Target Name (Quoted) . .20 2.1.10 The −MP Option: Emit phony targets. .20 2.1.11 The −F Option: Selecting a Debug Information Format .
    [Show full text]
  • AMD's Bulldozer Architecture
    AMD's Bulldozer Architecture Chris Ziemba Jonathan Lunt Overview • AMD's Roadmap • Instruction Set • Architecture • Performance • Later Iterations o Piledriver o Steamroller o Excavator Slide 2 1 Changed this section, bulldozer is covered in architecture so it makes sense to not reiterate with later slides Chris Ziemba, 鳬o AMD's Roadmap • October 2011 o First iteration, Bulldozer released • June 2013 o Piledriver, implemented in 2nd gen FX-CPUs • 2013 o Steamroller, implemented in 3rd gen FX-CPUs • 2014 o Excavator, implemented in 4th gen Fusion APUs • 2015 o Revised Excavator adopted in 2015 for FX-CPUs and beyond Instruction Set: Overview • Type: CISC • Instruction Set: x86-64 (AMD64) o Includes Old x86 Registers o Extends Registers and adds new ones o Two Operating Modes: Long Mode & Legacy Mode • Integer Size: 64 bits • Virtual Address Space: 64 bits o 16 EB of Address Space (17,179,869,184 GB) • Physical Address Space: 48 bits (Current Versions) o Saves space/transistors/etc o 256TB of Address Space Instruction Set: ISA Registers Instruction Set: Operating Modes Instruction Set: Extensions • Intel x86 Extensions o SSE4 : Streaming SIMD (Single Instruction, Multiple Data) Extension 4. Mainly for DSP and Graphics Processing. o AES-NI: Advanced Encryption Standard (AES) Instructions o AVX: Advanced Vector Extensions. 256 bit registers for computationally complex floating point operations such as image/video processing, simulation, etc. • AMD x86 Extensions o XOP: AMD specified SSE5 Revision o FMA4: Fused multiply-add (MAC) instructions
    [Show full text]
  • Linux Assembly HOWTO Linux Assembly HOWTO
    Linux Assembly HOWTO Linux Assembly HOWTO Table of Contents Linux Assembly HOWTO..................................................................................................................................1 Konstantin Boldyshev and François−René Rideau................................................................................1 1.INTRODUCTION................................................................................................................................1 2.DO YOU NEED ASSEMBLY?...........................................................................................................1 3.ASSEMBLERS.....................................................................................................................................1 4.METAPROGRAMMING/MACROPROCESSING............................................................................2 5.CALLING CONVENTIONS................................................................................................................2 6.QUICK START....................................................................................................................................2 7.RESOURCES.......................................................................................................................................2 1. INTRODUCTION...............................................................................................................................2 1.1 Legal Blurb........................................................................................................................................2
    [Show full text]
  • C++ Code __M128 Add (Const __M128 &X, Const __M128 &Y){ X X3 X2 X1 X0 Return Mm Add Ps(X, Y); } + + + + +
    ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications Final Project Related Issues Variable Sharing in OpenMP OpenMP synchronization issues OpenMP performance issues November 9, 2015 Lecture 24 © Dan Negrut, 2015 ECE/ME/EMA/CS 759 UW-Madison Quote of the Day “Without music to decorate it, time is just a bunch of boring production deadlines or dates by which bills must be paid.” -- Frank Zappa, Musician 1940 - 1993 2 Before We Get Started Issues covered last time: Final Project discussion Open MP optimization issues, wrap up Today’s topics SSE and AVX quick overview Parallel computing w/ MPI Other issues: HW08, due on Wd, Nov. 10 at 11:59 PM 3 Parallelism, as Expressed at Various Levels Cluster Group of computers communicating through fast interconnect Coprocessors/Accelerators Special compute devices attached to the local node through special interconnect Node Group of processors communicating through shared memory Socket Group of cores communicating through shared cache Core Group of functional units communicating through registers Hyper-Threads Group of thread contexts sharing functional units Superscalar Group of instructions sharing functional units Pipeline Sequence of instructions sharing functional units Vector Single instruction using multiple functional units Have discussed already Haven’t discussed yet 4 [Intel] Have discussed, but little direct control Instruction Set Architecture (ISA) Extensions Extensions to the base x86 ISA One way the x86 has evolved over the years Extensions for vectorizing
    [Show full text]
  • Different Emulators to Write 8086 Assembly Language Programs
    Different Emulators to write 8086 assembly language programs Subject: IWM Content • Emu8086 • TASM(Turbo Assembler) • MASM(Microsoft Macro Assembler) • NASM(Netwide Assembler) • FASM(Flat Assembler) Emu8086 • Emu8086 combines an advanced source editor, assembler, disassembler, software emulator with debugger, and step by step tutorials • It permit to assemble, emulate and debug 8086 programs. • This emulator was made for Windows, it works fine on GNU/Linux (with the help of Wine). • The source code is compiled by assembler and then executed on Emulator step-by-step, allowing to watch registers, flags and memory while program runs. how to run program on Emu8086 • Download Emu8086 through this link : https://download.cnet.com/Emu8086-Microprocessor- Emulator/3000-2069_4-10392690.html • Start Emu8086 by running Emu8086.exe • Select “Examples" from "File" menu. • Click “Emulate” button (or press F5). • Click “Single Step” button (or press F8) and watch how the code is being executed. Turbo Assembler(Tasm) • Turbo Assembler (TASM) is a computer assembler developed by Borland which runs on and produces code for 16- or 32-bit x86 DOS or Microsoft Windows. • The Turbo Assembler package is bundled with the Turbo Linker, and is interoperable with the Turbo Debugger. • Turbo Assembler (TASM) a small 16-bit computer program which enables us to write 16 bit i.e. x86 programming code on 32-bit machine. It can be used with any high level language compliers like GCC compiler set to build object files. So that programmers can use their daily routine machines to write 16-bit code and execute on x86 devices. how to run program using TASM • Download TASM through this link : https://techapple.net/2013/01/tasm-windows-7-windows-8-full- screen-64bit-version-single-installer/ • Start TASM by running tasm.exe • It will open DOSBOX.
    [Show full text]
  • First Osborne Group (FOG) Records
    http://oac.cdlib.org/findaid/ark:/13030/c8611668 No online items First Osborne Group (FOG) records Finding aid prepared by Jack Doran and Sara Chabino Lott Processing of this collection was made possible through generous funding from the National Archives’ National Historical Publications & Records Commission: Access to Historical Records grant. Computer History Museum 1401 N. Shoreline Blvd. Mountain View, CA, 94043 (650) 810-1010 [email protected] August, 2019 First Osborne Group (FOG) X4071.2007 1 records Title: First Osborne Group (FOG) records Identifier/Call Number: X4071.2007 Contributing Institution: Computer History Museum Language of Material: English Physical Description: 26.57 Linear feet, 3 record cartons, 5 manuscript boxes, 2 periodical boxes, 18 software boxes Date (bulk): Bulk, 1981-1993 Date (inclusive): 1979-1997 Abstract: The First Osborne Group (FOG) records contain software and documentation created primarily between 1981 and 1993. This material was created or authored by FOG members for other members using hardware compatible with CP/M and later MS and PC-DOS software. The majority of the collection consists of software written by FOG members to be shared through the library. Also collected are textual materials held by the library, some internal correspondence, and an incomplete collection of the FOG newsletters. creator: First Osborne Group. Processing Information Collection surveyed by Sydney Gulbronson Olson, 2017. Collection processed by Jack Doran, 2019. Access Restrictions The collection is open for research. Publication Rights The Computer History Museum (CHM) can only claim physical ownership of the collection. Users are responsible for satisfying any claims of the copyright holder. Requests for copying and permission to publish, quote, or reproduce any portion of the Computer History Museum’s collection must be obtained jointly from both the copyright holder (if applicable) and the Computer History Museum as owner of the material.
    [Show full text]
  • Internals of the Netwide Assembler ======
    Internals of the Netwide Assembler ================================== The Netwide Assembler is intended to be a modular, re-usable x86 assembler, which can be embedded in other programs, for example as the back end to a compiler. The assembler is composed of modules. The interfaces between them look like: +--- preproc.c ----+ | | +---- parser.c ----+ | | | | float.c | | | +--- assemble.c ---+ | | | nasm.c ---+ insnsa.c +--- nasmlib.c | | +--- listing.c ----+ | | +---- labels.c ----+ | | +--- outform.c ----+ | | +----- *out.c -----+ In other words, each of `preproc.c', `parser.c', `assemble.c', `labels.c', `listing.c', `outform.c' and each of the output format modules `*out.c' are independent modules, which do not directly inter-communicate except through the main program. The Netwide *Disassembler* is not intended to be particularly portable or reusable or anything, however. So I won't bother documenting it here. :-) nasmlib.c --------- This is a library module; it contains simple library routines which may be referenced by all other modules. Among these are a set of wrappers around the standard `malloc' routines, which will report a fatal error if they run out of memory, rather than returning NULL. preproc.c --------- This contains a macro preprocessor, which takes a file name as input and returns a sequence of preprocessed source lines. The only symbol exported from the module is `nasmpp', which is a data structure of type `Preproc', declared in nasm.h. This structure contains pointers to all the functions designed to be callable from outside the module. parser.c -------- This contains a source-line parser. It parses `canonical' assembly source lines, containing some combination of the `label', `opcode', `operand' and `comment' fields: it does not process directives or macros.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • An Introduction to CUDA/Opencl and Graphics Processors
    An Introduction to CUDA/OpenCL and Graphics Processors Bryan Catanzaro, NVIDIA Research Overview ¡ Terminology ¡ The CUDA and OpenCL programming models ¡ Understanding how CUDA maps to NVIDIA GPUs ¡ Thrust 2/74 Heterogeneous Parallel Computing Latency Throughput Optimized CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/74 Latency vs. Throughput Latency Throughput ¡ Latency: yoke of oxen § Each core optimized for executing a single thread ¡ Throughput: flock of chickens § Cores optimized for aggregate throughput, deemphasizing individual performance ¡ (apologies to Seymour Cray) 4/74 Latency vs. Throughput, cont. Specificaons Sandy Bridge- Kepler EP (Tesla K20) 8 cores, 2 issue, 14 SMs, 6 issue, 32 Processing Elements 8 way SIMD way SIMD @3.1 GHz @730 MHz 8 cores, 2 threads, 8 14 SMs, 64 SIMD Resident Strands/ way SIMD: vectors, 32 way Threads (max) SIMD: Sandy Bridge-EP (32nm) 96 strands 28672 threads SP GFLOP/s 396 3924 Memory Bandwidth 51 GB/s 250 GB/s Register File 128 kB (?) 3.5 MB Local Store/L1 Cache 256 kB 896 kB L2 Cache 2 MB 1.5 MB L3 Cache 20 MB - Kepler (28nm) 5/74 Why Heterogeneity? ¡ Different goals produce different designs § Manycore assumes work load is highly parallel § Multicore must be good at everything, parallel or not ¡ Multicore: minimize latency experienced by 1 thread § lots of big on-chip caches § extremely sophisticated control ¡ Manycore: maximize throughput of all threads § lots of big ALUs § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over
    [Show full text]