Software Optimization Guide for the AMD Athlon 64 and AMD Opteron

Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors Publication # 25112 Revision: 3.04 Issue Date: March 2004 © 2001 – 2004 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, and combinations thereof, 3DNow! and AMD-8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft is a registered trademark of Microsoft Corporation. MMX is a trademark of Intel Corporation. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. 25112 Rev. 3.04 March 2004 Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors Contents Revision History . .xv Chapter 1 Introduction . .1 1.1 Intended Audience . .1 1.2 Getting Started Quickly . .1 1.3 Using This Guide . .2 1.4 Important New Terms . .4 1.5 Key Optimizations . .6 Chapter 2 C and C++ Source-Level Optimizations . .7 2.1 Declarations of Floating-Point Values . .9 2.2 Using Arrays and Pointers . .10 2.3 Unrolling Small Loops . .13 2.4 Expression Order in Compound Branch Conditions . .14 2.5 Long Logical Expressions in If Statements . .16 2.6 Arrange Boolean Operands for Quick Expression Evaluation . .17 2.7 Dynamic Memory Allocation Consideration . .19 2.8 Unnecessary Store-to-Load Dependencies . .20 2.9 Matching Store and Load Size . .22 2.10 SWITCH and Noncontiguous Case Expressions . .25 2.11 Arranging Cases by Probability of Occurrence . .28 2.12 Use of Function Prototypes . .29 2.13 Use of const Type Qualifier . .30 2.14 Generic Loop Hoisting . .31 2.15 Local Static Functions . .34 2.16 Explicit Parallelism in Code . .35 2.17 Extracting Common Subexpressions . .37 2.18 Sorting and Padding C and C++ Structures . .39 2.19 Sorting Local Variables . .41 Contents iii Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ 25112 Rev. 3.04 March 2004 Processors 2.20 Replacing Integer Division with Multiplication . .43 2.21 Frequently Dereferenced Pointer Arguments . .44 2.22 Array Indices . .46 2.23 32-Bit Integral Data Types . .47 2.24 Sign of Integer Operands . .48 2.25 Accelerating Floating-Point Division and Square Root . .50 2.26 Fast Floating-Point-to-Integer Conversion . .52 2.27 Speeding Up Branches Based on Comparisons Between Floats . .54 2.28 Improving Performance in Linux Libraries . .57 Chapter 3 General 64-Bit Optimizations . .59 3.1 64-Bit Registers and Integer Arithmetic . .60 3.2 64-Bit Arithmetic and Large-Integer Multiplication . .62 3.3 128-Bit Media Instructions and Floating-Point Operations . .67 3.4 32-Bit Legacy GPRs and Small Unsigned Integers . .68 Chapter 4 Instruction-Decoding Optimizations . .71 4.1 DirectPath Instructions . .72 4.2 Load-Execute Instructions . .73 4.2.1 Load-Execute Integer Instructions . .73 4.2.2 Load-Execute Floating-Point Instructions with Floating-Point Operands . .74 4.2.3 Load-Execute Floating-Point Instructions with Integer Operands . .74 4.3 Branch Targets in Program Hot Spots . .76 4.4 32/64-Bit vs. 16-Bit Forms of the LEA Instruction . .77 4.5 Short Instruction Encodings . .78 4.6 Partial-Register Reads and Writes . .79 4.7 Using LEAVE for Function Epilogues . .81 4.8 Alternatives to SHLD Instruction . .83 4.9 8-Bit Sign-Extended Immediate Values . .85 4.10 8-Bit Sign-Extended Displacements . .86 4.11 Code Padding with Operand-Size Override and NOP . .87 Chapter 5 Cache and Memory Optimizations . .89 iv Contents 25112 Rev. 3.04 March 2004 Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors 5.1 Memory-Size Mismatches . .90 5.2 Natural Alignment of Data Objects . .93 5.3 Cache-Coherent Nonuniform Memory Access (ccNUMA) . .94 5.4 Multiprocessor Considerations . .97 5.5 Store-to-Load Forwarding Restrictions . .98 5.6 Prefetch Instructions . .102 5.7 Write-combining . .110 5.8 L1 Data Cache Bank Conflicts . .111 5.9 Placing Code and Data in the Same 64-Byte Cache Line . .113 5.10 Sorting and Padding C and C++ Structures . .114 5.11 Sorting Local Variables . .116 5.12 Appropriate Memory Copying Routines . .117 5.13 Stack Considerations . .128 5.14 Cache Issues when Writing Instruction Bytes to Memory . .129 5.15 Interleave Loads and Stores . .130 Chapter 6 Branch Optimizations . .131 6.1 Density of Branches . .132 6.2 Two-Byte Near-Return RET Instruction . .134 6.3 Branches That Depend on Random Data . .136 6.4 Pairing CALL and RETURN . ..

Software Optimization Guide for the AMD Athlon 64 and AMD Opteron

Owner's Manual

Evaluation of AMD EPYC

Multiprocessing Contents

The Wysiwyg and Vivien Hardware Guide

Solaris 10 OS on AMD Opteron Processor-Based Systems

AMD's Early Processor Lines, up to the Hammer Family (Families K8

AMD Opteron™ 4000 Series Platform

Cache Hierarchy and Memory Subsystem of the Amd Opteron Processor

The AMD Opteron™ Processor Is the Ideal Solution for High Performance Computing

Modern Computer Architecture and Organization

CPU History [Tualatin] [Banias] [Dothan] [Yonah (Jonah)] [Conroe] [Allendale] [Yorkfield XE] Intel Created Pentium (From Quad-Core CPU

AMD Opteron™ and Intel® Xeon® X86 Processors in Industry-Standard Servers