Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

2. Optimizing subroutines in assembly language An optimization guide for x86 platforms By Agner Fog Copyright © 1996 - 2006. Last updated 2006-07-05. Contents 1 Introduction ....................................................................................................................... 4 1.1 Reasons for using assembly code .............................................................................. 4 1.2 Reasons for not using assembly code ........................................................................ 5 1.3 Microprocessors covered by this manual .................................................................... 6 1.4 Operating systems covered by this manual................................................................. 6 2 Before you start................................................................................................................. 7 2.1 Things to decide before you start programming .......................................................... 7 2.2 Make a test strategy.................................................................................................... 8 3 The basics of assembly coding.......................................................................................... 9 3.1 Assembly language syntaxes...................................................................................... 9 3.2 Register set and basic instructions............................................................................ 10 3.3 Addressing modes .................................................................................................... 14 3.4 Instruction code format ............................................................................................. 18 3.5 Instruction prefixes.................................................................................................... 19 4 ABI standards.................................................................................................................. 20 4.1 Register usage.......................................................................................................... 20 4.2 Data storage ............................................................................................................. 21 4.3 Function calling conventions..................................................................................... 21 4.4 Name mangling and name decoration ...................................................................... 23 4.5 Function examples.................................................................................................... 23 5 Using intrinsic functions in C++ ....................................................................................... 26 5.1 Using intrinsic functions for system code .................................................................. 27 5.2 Using intrinsic functions for instructions not available in standard C++ ..................... 28 5.3 Using intrinsic functions for vector operations ........................................................... 28 5.4 Availability of intrinsic functions................................................................................. 28 6 Using inline assembly in C++ .......................................................................................... 28 6.1 MASM style inline assembly ..................................................................................... 29 6.2 Gnu style inline assembly ......................................................................................... 34 7 Using an assembler......................................................................................................... 37 7.1 Static link libraries..................................................................................................... 38 7.2 Dynamic link libraries................................................................................................ 39 7.3 Libraries in source code form.................................................................................... 40 7.4 Making classes in assembly...................................................................................... 40 7.5 Thread-safe functions ............................................................................................... 42 7.6 Makefiles .................................................................................................................. 42 8 Making function libraries compatible with multiple compilers and platforms..................... 43 8.1 Supporting multiple name mangling schemes........................................................... 44 8.2 Supporting multiple calling conventions in 32 bit mode ............................................. 45 8.3 Supporting multiple calling conventions in 64 bit mode ............................................. 48 8.4 Supporting different object file formats...................................................................... 49 8.5 Supporting other high level languages ...................................................................... 50 9 Optimizing for speed ....................................................................................................... 50 9.1 Identify the most critical parts of your code ............................................................... 50 9.2 Out of order execution .............................................................................................. 51 9.3 Instruction fetch, decoding and retirement ................................................................ 53 9.4 Instruction latency and throughput ............................................................................ 54 9.5 Break dependence chains......................................................................................... 55 9.6 Jumps and calls........................................................................................................ 56 10 Optimizing for size......................................................................................................... 62 10.1 Choosing shorter instructions.................................................................................. 62 10.2 Using shorter constants and addresses .................................................................. 63 10.3 Reusing constants .................................................................................................. 64 10.4 Constants in 64-bit mode ........................................................................................ 64 10.5 Addresses and pointers in 64-bit mode................................................................... 64 10.6 Making instructions longer for the sake of alignment............................................... 66 11 Optimizing memory access............................................................................................ 69 11.1 How caching works................................................................................................. 69 11.2 Trace cache............................................................................................................ 70 11.3 Alignment of data.................................................................................................... 70 11.4 Alignment of code ................................................................................................... 73 11.5 Organizing data for improved caching..................................................................... 74 11.6 Organizing code for improved caching.................................................................... 75 11.7 Cache control instructions....................................................................................... 75 12 Loops ............................................................................................................................ 76 12.1 Minimize loop overhead .......................................................................................... 76 12.2 Induction variables.................................................................................................. 79 12.3 Move loop-invariant code........................................................................................ 80 12.4 Find the bottlenecks................................................................................................ 80 12.5 Instruction fetch, decoding and retirement in a loop ................................................ 80 12.6 Distribute uops evenly between execution units...................................................... 81 12.7 An example of analysis for bottlenecks ................................................................... 82 12.8 Loop unrolling ......................................................................................................... 85 12.9 Optimize caching .................................................................................................... 87 12.10 Parallelization ....................................................................................................... 88 12.11 Analyzing dependences........................................................................................ 90 12.12 Loops on processors without out-of-order execution ............................................. 92 12.13 Macro loops .......................................................................................................... 94 13 Vector programming...................................................................................................... 96 13.1

Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

AMD Athlon™ Processor X86 Code Optimization Guide

E0C88 CORE CPU MANUAL NOTICE No Part of This Material May Be Reproduced Or Duplicated in Any Form Or by Any Means Without the Written Permission of Seiko Epson

Cmp Instruction in Assembly Language

Lecture Notes in Assembly Language

Targeting Embedded Powerpc

Codewarrior® Targeting Embedded Powerpc

(PSW). Seven Bits Remain Unused While the Rest Nine Are Used

AVR Control Transfer -AVR Branching

Overview of IA-32 Assembly Programming

Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

Assembly Language: IA-X86

CS221 Booleans, Comparison, Jump Instructions Chapter 6 Assembly Language Is a Great Choice When It Comes to Working on Individu