Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms 2. Optimizing subroutines in assembly language An optimization guide for x86 platforms By Agner Fog Copyright © 1996 - 2006. Last updated 2006-07-05. Contents 1 Introduction ....................................................................................................................... 4 1.1 Reasons for using assembly code .............................................................................. 4 1.2 Reasons for not using assembly code ........................................................................ 5 1.3 Microprocessors covered by this manual .................................................................... 6 1.4 Operating systems covered by this manual................................................................. 6 2 Before you start................................................................................................................. 7 2.1 Things to decide before you start programming .......................................................... 7 2.2 Make a test strategy.................................................................................................... 8 3 The basics of assembly coding.......................................................................................... 9 3.1 Assembly language syntaxes...................................................................................... 9 3.2 Register set and basic instructions............................................................................ 10 3.3 Addressing modes .................................................................................................... 14 3.4 Instruction code format ............................................................................................. 18 3.5 Instruction prefixes.................................................................................................... 19 4 ABI standards.................................................................................................................. 20 4.1 Register usage.......................................................................................................... 20 4.2 Data storage ............................................................................................................. 21 4.3 Function calling conventions..................................................................................... 21 4.4 Name mangling and name decoration ...................................................................... 23 4.5 Function examples.................................................................................................... 23 5 Using intrinsic functions in C++ ....................................................................................... 26 5.1 Using intrinsic functions for system code .................................................................. 27 5.2 Using intrinsic functions for instructions not available in standard C++ ..................... 28 5.3 Using intrinsic functions for vector operations ........................................................... 28 5.4 Availability of intrinsic functions................................................................................. 28 6 Using inline assembly in C++ .......................................................................................... 28 6.1 MASM style inline assembly ..................................................................................... 29 6.2 Gnu style inline assembly ......................................................................................... 34 7 Using an assembler......................................................................................................... 37 7.1 Static link libraries..................................................................................................... 38 7.2 Dynamic link libraries................................................................................................ 39 7.3 Libraries in source code form.................................................................................... 40 7.4 Making classes in assembly...................................................................................... 40 7.5 Thread-safe functions ............................................................................................... 42 7.6 Makefiles .................................................................................................................. 42 8 Making function libraries compatible with multiple compilers and platforms..................... 43 8.1 Supporting multiple name mangling schemes........................................................... 44 8.2 Supporting multiple calling conventions in 32 bit mode ............................................. 45 8.3 Supporting multiple calling conventions in 64 bit mode ............................................. 48 8.4 Supporting different object file formats...................................................................... 49 8.5 Supporting other high level languages ...................................................................... 50 9 Optimizing for speed ....................................................................................................... 50 9.1 Identify the most critical parts of your code ............................................................... 50 9.2 Out of order execution .............................................................................................. 51 9.3 Instruction fetch, decoding and retirement ................................................................ 53 9.4 Instruction latency and throughput ............................................................................ 54 9.5 Break dependence chains......................................................................................... 55 9.6 Jumps and calls........................................................................................................ 56 10 Optimizing for size......................................................................................................... 62 10.1 Choosing shorter instructions.................................................................................. 62 10.2 Using shorter constants and addresses .................................................................. 63 10.3 Reusing constants .................................................................................................. 64 10.4 Constants in 64-bit mode ........................................................................................ 64 10.5 Addresses and pointers in 64-bit mode................................................................... 64 10.6 Making instructions longer for the sake of alignment............................................... 66 11 Optimizing memory access............................................................................................ 69 11.1 How caching works................................................................................................. 69 11.2 Trace cache............................................................................................................ 70 11.3 Alignment of data.................................................................................................... 70 11.4 Alignment of code ................................................................................................... 73 11.5 Organizing data for improved caching..................................................................... 74 11.6 Organizing code for improved caching.................................................................... 75 11.7 Cache control instructions....................................................................................... 75 12 Loops ............................................................................................................................ 76 12.1 Minimize loop overhead .......................................................................................... 76 12.2 Induction variables.................................................................................................. 79 12.3 Move loop-invariant code........................................................................................ 80 12.4 Find the bottlenecks................................................................................................ 80 12.5 Instruction fetch, decoding and retirement in a loop ................................................ 80 12.6 Distribute uops evenly between execution units...................................................... 81 12.7 An example of analysis for bottlenecks ................................................................... 82 12.8 Loop unrolling ......................................................................................................... 85 12.9 Optimize caching .................................................................................................... 87 12.10 Parallelization ....................................................................................................... 88 12.11 Analyzing dependences........................................................................................ 90 12.12 Loops on processors without out-of-order execution ............................................. 92 12.13 Macro loops .......................................................................................................... 94 13 Vector programming...................................................................................................... 96 13.1
Recommended publications
  • AMD Athlon™ Processor X86 Code Optimization Guide
    AMD AthlonTM Processor x86 Code Optimization Guide © 2000 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other applica- tion in which the failure of AMD’s product could create a situation where per- sonal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, AMD-751, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
    [Show full text]
  • E0C88 CORE CPU MANUAL NOTICE No Part of This Material May Be Reproduced Or Duplicated in Any Form Or by Any Means Without the Written Permission of Seiko Epson
    MF658-04 CMOS 8-BIT SINGLE CHIP MICROCOMPUTER E0C88 Family E0C88 CORE CPU MANUAL NOTICE No part of this material may be reproduced or duplicated in any form or by any means without the written permission of Seiko Epson. Seiko Epson reserves the right to make changes to this material without notice. Seiko Epson does not assume any liability of any kind arising out of any inaccuracies contained in this material or due to its application or use in any product or circuit and, further, there is no representation that this material is applicable to products requiring high level reliability, such as medical products. Moreover, no license to any intellectual property rights is granted by implication or otherwise, and there is no representation or warranty that anything made in accordance with this material will be free from any patent or copyright infringement of a third party. This material or portions thereof may contain technology or the subject relating to strategic products under the control of the Foreign Exchange and Foreign Trade Control Law of Japan and may require an export license from the Ministry of International Trade and Industry or other approval from another government agency. Please note that "E0C" is the new name for the old product "SMC". If "SMC" appears in other manuals understand that it now reads "E0C". © SEIKO EPSON CORPORATION 1999 All rights reserved. CONTENTS E0C88 Core CPU Manual PREFACE This manual explains the architecture, operation and instruction of the core CPU E0C88 of the CMOS 8-bit single chip microcomputer E0C88 Family. Also, since the memory configuration and the peripheral circuit configuration is different for each device of the E0C88 Family, you should refer to the respective manuals for specific details other than the basic functions.
    [Show full text]
  • Cmp Instruction in Assembly Language
    Cmp Instruction In Assembly Language booby-trapsDamien pedaling sound. conjointly? Untidiest SupposableGraham precools, and uncleansed his ilexes geologisesNate shrouds globed her chokos contemptuously. espied or It works because when a short jump if the stack to fill out over an implied subtraction of assembly language is the arithmetic operation of using our point Jump if actual assembler or of assembly is correct results in sequence will stay organized. If you want to prove that will you there is cmp instruction in assembly language, sp can also accept one, it takes quite useful sample programs in game reports to. Acts as the instruction pointer should be checked for? Please switch your skills, cmp instruction cmp in assembly language. To double word register onto the next step process in assembly language equivalent to collect great quiz. The comparisons and the ip value becomes a row! Subsequent generations became new game code starts in main program plays a few different memory locations declared with. Want to switch, many commands to thumb instructions will not examine the difference between filtration and in assembly language are used as conditional jumps are calls to the call. Note that was no difference in memory operands and its corresponding comparison yielded an explicit, xor is running but also uses a perennial study guide can exit? Each question before you want to see the control flow and play this instruction cmp in assembly language? This value off this for assembly language, cmp instruction in assembly language? But shl can explicitly directed as efficient ways to compare values stored anywhere in terms indicated by cmp instruction in assembly language calling convention rules.
    [Show full text]
  • Lecture Notes in Assembly Language
    Lecture Notes in Assembly Language Short introduction to low-level programming Piotr Fulmański Łódź, 12 czerwca 2015 Spis treści Spis treści iii 1 Before we begin1 1.1 Simple assembler.................................... 1 1.1.1 Excercise 1 ................................... 2 1.1.2 Excercise 2 ................................... 3 1.1.3 Excercise 3 ................................... 3 1.1.4 Excercise 4 ................................... 5 1.1.5 Excercise 5 ................................... 6 1.2 Improvements, part I: addressing........................... 8 1.2.1 Excercise 6 ................................... 11 1.3 Improvements, part II: indirect addressing...................... 11 1.4 Improvements, part III: labels............................. 18 1.4.1 Excercise 7: find substring in a string .................... 19 1.4.2 Excercise 8: improved polynomial....................... 21 1.5 Improvements, part IV: flag register ......................... 23 1.6 Improvements, part V: the stack ........................... 24 1.6.1 Excercise 12................................... 26 1.7 Improvements, part VI – function stack frame.................... 29 1.8 Finall excercises..................................... 34 1.8.1 Excercise 13................................... 34 1.8.2 Excercise 14................................... 34 1.8.3 Excercise 15................................... 34 1.8.4 Excercise 16................................... 34 iii iv SPIS TREŚCI 1.8.5 Excercise 17................................... 34 2 First program 37 2.1 Compiling,
    [Show full text]
  • Targeting Embedded Powerpc
    Freescale Semiconductor, Inc. EPPC.book Page 1 Monday, March 28, 2005 9:22 AM CodeWarrior™ Development Studio PowerPC™ ISA Communications Processors Edition Targeting Manual Revised: 28 March 2005 For More Information: www.freescale.com Freescale Semiconductor, Inc. EPPC.book Page 2 Monday, March 28, 2005 9:22 AM Metrowerks, the Metrowerks logo, and CodeWarrior are trademarks or registered trademarks of Metrowerks Corpora- tion in the United States and/or other countries. All other trade names and trademarks are the property of their respective owners. Copyright © 2005 by Metrowerks, a Freescale Semiconductor company. All rights reserved. No portion of this document may be reproduced or transmitted in any form or by any means, electronic or me- chanical, without prior written permission from Metrowerks. Use of this document and related materials are governed by the license agreement that accompanied the product to which this manual pertains. This document may be printed for non-commercial personal use only in accordance with the aforementioned license agreement. If you do not have a copy of the license agreement, contact your Metrowerks representative or call 1-800-377- 5416 (if outside the U.S., call +1-512-996-5300). Metrowerks reserves the right to make changes to any product described or referred to in this document without further notice. Metrowerks makes no warranty, representation or guarantee regarding the merchantability or fitness of its prod- ucts for any particular purpose, nor does Metrowerks assume any liability arising
    [Show full text]
  • Codewarrior® Targeting Embedded Powerpc
    CodeWarrior® Targeting Embedded PowerPC Because of last-minute changes to CodeWarrior, some of the information in this manual may be inaccurate. Please read the Release Notes on the CodeWarrior CD for the most recent information. Revised: 991129-CIB Metrowerks CodeWarrior copyright ©1993–1999 by Metrowerks Inc. and its licensors. All rights reserved. Documentation stored on the compact disk(s) may be printed by licensee for personal use. Except for the foregoing, no part of this documentation may be reproduced or trans- mitted in any form by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from Metrowerks Inc. Metrowerks, the Metrowerks logo, CodeWarrior, and Software at Work are registered trademarks of Metrowerks Inc. PowerPlant and PowerPlant Constructor are trademarks of Metrowerks Inc. All other trademarks and registered trademarks are the property of their respective owners. ALL SOFTWARE AND DOCUMENTATION ON THE COMPACT DISK(S) ARE SUBJECT TO THE LICENSE AGREEMENT IN THE CD BOOKLET. How to Contact Metrowerks: U.S.A. and international Metrowerks Corporation 9801 Metric Blvd., Suite 100 Austin, TX 78758 U.S.A. Canada Metrowerks Inc. 1500 du College, Suite 300 Ville St-Laurent, QC Canada H4L 5G6 Ordering Voice: (800) 377–5416 Fax: (512) 873–4901 World Wide Web http://www.metrowerks.com Registration information [email protected] Technical support [email protected] Sales, marketing, & licensing [email protected] CompuServe Goto: Metrowerks Table of Contents 1 Introduction 11 Read the Release Notes! . 11 Solaris: Host-Specific Information. 12 About This Book . 12 Where to Go from Here .
    [Show full text]
  • (PSW). Seven Bits Remain Unused While the Rest Nine Are Used
    8086/8088MP INSTRUCTOR: ABDULMUTTALIB A. H. ALDOURI The Flags Register It is a 16-bit register, also called Program Status Word (PSW). Seven bits remain unused while the rest nine are used. Six are status flags and three are control flags. The control flags can be set/reset by the programmer. 1. DF (Direction Flag) : controls the direction of operation of string instructions. (DF=0 Ascending order DF=1 Descending order) 2. IF (Interrupt Flag): controls the interrupt operation in 8086µP. (IF=0 Disable interrupt IF=1 Enable interrupt) 3. TF (Trap Flag): controls the operation of the microprocessor. (TF=0 Normal operation TF=1 Single Step operation) The status flags are set/reset depending on the results of some arithmetic or logical operations during program execution. 1. CF (Carry Flag) is set (CF=1) if there is a carry out of the MSB position resulting from an addition operation or subtraction. 2. AF (Auxiliary Carry Flag) AF is set if there is a carry out of bit 3 resulting from an addition operation. 3. SF (Sign Flag) set to 1 when result is negative. When result is positive it is set to 0. 4. ZF (Zero Flag) is set (ZF=1) when result of an arithmetic or logical operation is zero. For non-zero result this flag is reset (ZF=0). 5. PF (Parity Flag) this flag is set to 1 when there is even number of one bits in result, and to 0 when there is odd number of one bits. 6. OF (Overflow Flag) set to 1 when there is a signed overflow.
    [Show full text]
  • AVR Control Transfer -AVR Branching
    1 | P a g e AVR Control Transfer -AVR Branching Reading The AVR Microcontroller and Embedded Systems using Assembly and C) by Muhammad Ali Mazidi, Sarmad Naimi, and Sepehr Naimi Chapter 3: Branch, Call, and Time Delay Loop Section 3.1: Branching and Looping (Branch Only) Additional Reading Introduction to AVR assembler programming for beginners, controlling sequential execution of the program http://www.avr-asm-tutorial.net/avr_en/beginner/JUMP.html AVR Assembler User Guide http://www.atmel.com/dyn/resources/prod documents/doc1022.pdf 2 | P a g e TABLE OF CONTENTS Instruction Set Architecture (Review) ...................................................................................................................... 4 Instruction Set (Review) ........................................................................................................................................... 5 Jump Instructions..................................................................................................................................................... 6 How the Direct Unconditional Control Transfer Instructions jmp and call Work ...................................................... 7 How the Relative Unconditional Control Transfer Instructions rjmp and rcall Work ................................................ 8 Branch Instructions .................................................................................................................................................. 9 How the Relative Conditional Control Transfer Instruction
    [Show full text]
  • Overview of IA-32 Assembly Programming
    Overview of IA-32 assembly programming Lars Ailo Bongo University of Tromsø Contents 1 Introduction ...................................................................................................................... 2 2 IA-32 assembly programming.......................................................................................... 3 2.1 Assembly Language Statements................................................................................ 3 2.1 Modes........................................................................................................................4 2.2 Registers....................................................................................................................4 2.2.3 Data Registers .................................................................................................... 4 2.2.4 Pointer and Index Registers................................................................................ 4 2.2.5 Control Registers................................................................................................ 5 2.2.6 Segment registers ............................................................................................... 7 2.3 Addressing................................................................................................................. 7 2.3.1 Bit and Byte Order ............................................................................................. 7 2.3.2 Data Types.........................................................................................................
    [Show full text]
  • Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms
    2. Optimizing subroutines in assembly language An optimization guide for x86 platforms By Agner Fog. Copenhagen University College of Engineering. Copyright © 1996 - 2012. Last updated 2012-02-29. Contents 1 Introduction ....................................................................................................................... 4 1.1 Reasons for using assembly code .............................................................................. 5 1.2 Reasons for not using assembly code ........................................................................ 5 1.3 Microprocessors covered by this manual .................................................................... 6 1.4 Operating systems covered by this manual................................................................. 7 2 Before you start................................................................................................................. 7 2.1 Things to decide before you start programming .......................................................... 7 2.2 Make a test strategy.................................................................................................... 9 2.3 Common coding pitfalls............................................................................................. 10 3 The basics of assembly coding........................................................................................ 12 3.1 Assemblers available ................................................................................................ 12 3.2 Register set
    [Show full text]
  • Assembly Language: IA-X86
    Assembly Language for x86 Processors X86 Processor Architecture CS 271 Computer Architecture Purdue University Fort Wayne 1 Outline Basic IA Computer Organization IA-32 Registers Instruction Execution Cycle Basic IA Computer Organization Since the 1940's, the Von Neumann computers contains three key components: Processor, called also the CPU (Central Processing Unit) Memory and Storage Devices I/O Devices Interconnected with one or more buses Data Bus Address Bus data bus Control Bus registers Processor I/O I/O IA: Intel Architecture Memory Device Device (CPU) #1 #2 32-bit (or i386) ALU CU clock control bus address bus Processor The processor consists of Datapath ALU Registers Control unit ALU (Arithmetic logic unit) Performs arithmetic and logic operations Control unit (CU) Generates the control signals required to execute instructions Memory Address Space Address Space is the set of memory locations (bytes) that are addressable Next ... Basic Computer Organization IA-32 Registers Instruction Execution Cycle Registers Registers are high speed memory inside the CPU Eight 32-bit general-purpose registers Six 16-bit segment registers Processor Status Flags (EFLAGS) and Instruction Pointer (EIP) 32-bit General-Purpose Registers EAX EBP EBX ESP ECX ESI EDX EDI 16-bit Segment Registers EFLAGS CS ES SS FS EIP DS GS General-Purpose Registers Used primarily for arithmetic and data movement mov eax 10 ;move constant integer 10 into register eax Specialized uses of Registers eax – Accumulator register Automatically
    [Show full text]
  • CS221 Booleans, Comparison, Jump Instructions Chapter 6 Assembly Language Is a Great Choice When It Comes to Working on Individu
    CS221 Booleans, Comparison, Jump Instructions Chapter 6 Assembly language is a great choice when it comes to working on individual bits of data. While some languages like C and C++ include bitwise operators, several high-level languages are missing these operations entirely (e.g. Visual Basic 6.0). At times these operations can be quite useful. First we will describe some common bit operations, and then discuss conditional jumps. AND Instruction The AND instruction performs a Boolean AND between all bits of the two operands. For example: mov al, 00111011b and al, 00001111b The result is that AL contains 00001011. We have “multiplied” each corresponding bit. We have used AND for a common operation in this case, to clear out the high nibble. Sometimes the operand we are AND’ing something with is called a bit mask because wherever there is a zero the result is zero (masking out the original data), and wherever we have a one, we copy the original data through the mask. For example, consider an ASCII character. “Low” ASCII ignores the high bit; we only need the low 7 bits and we might use the high bit for either special characters or perhaps as a parity bit. Given an arbitrary 8 bits for a character, we could apply a mask that removes the high bit and only retains the low bits: and al, 01111111b OR Instruction The OR instruction performs a Boolean OR between all bits of the two operands. For example: mov al, 00111011b or al, 00001111b As a result AL contains 00111111. We have “Added” each corresponding bit, throwing away the carry.
    [Show full text]