Origin2000™ and Onyx2™ Performance Tuning and Optimization Guide

Origin2000™ and Onyx2™ Performance Tuning and Optimization Guide Document Number 007-3430-002 CONTRIBUTORS Written by David Cortesi, based on the first edition by Jeff Fier Illustrated by Dan Young Edited by Christina Cary Production by Kirsten Pekarek Engineering contributions by David Cortesi, Leo Dagum, Wesley Jones, Eric Salo, Igor Zacharov, Marco Zagha St. Peter’s Basilica image courtesy of ENEL SpA and InfoByte SpA. Disk Thrower image courtesy of Xavier Berenguer, Animatica. © 1998, Silicon Graphics, Inc.— All Rights Reserved The contents of this document may not be copied or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the technical data contained in this document by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94043-1389. Silicon Graphics, CHALLENGE, Indy, IRIX, and Onyx are registered trademarks and the Silicon Graphics logo, InfiniteReality, O2, Onyx2, Origin200, Origin2000, POWER CHALLENGE, POWER CHALLENGE 10000, and XFS are trademarks of Silicon Graphics, Inc. CRAY is a registered trademark, and CrayLink is a trademark of Cray Research, Inc. POSIX is a trademark of IEEE. MIPS, R4000, R4400, R5000, R8000, and R10000 are registered trademarks and MIPSpro is a trademark of MIPS Technologies, Inc. NFS is a registered trademark of Sun Microsystems, Inc. X Window System is a trademark of X Consortium, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. Origin2000™ and Onyx2™ Performance Tuning and Optimization Guide Document Number 007-3430-002 Contents List of Examples xvii List of Figures xxiii List of Tables xxv About This Guide xxvii Who Can Benefit from This Guide xxvii What the Guide Contains xxviii Related Documents xxix Related Manuals xxix Hardware Manuals xxix Compiler Manuals xxix Software Tool Manuals xxx Third-Party Resources xxx Related Reference Pages xxxi Text Conventions xxxii 1. Understanding SN0 Architecture 1 Understanding Scalable Multiprocessor Memory 1 Memory for Multiprocessors 1 Shared Memory Multiprocessing 1 Distributed Memory Multiprocessing 3 Scalability in Multiprocessors 4 Scalability and Shared, Distributed Memory 5 iii Contents Understanding Scalable Shared Memory 6 SN0 Organization 6 SN0 Memory Distribution 8 SN0 Node Board 10 CPUs and Memory 11 Memory Overhead Bits 11 Hub and CrayLink 11 XIO Connection 12 Understanding Cache Coherency 12 Coherency Methods 13 Understanding Directory-Based Coherency 13 Modifying Shared Data 15 Reading Modified Data 15 Other Protocols 16 Memory Contention 16 SN0 Input/Output 16 I/O Connections and Bandwidth 17 I/O Access to Memory 18 SN0 Latencies and Bandwidths 18 Understanding MIPS R10000 Architecture 20 Superscalar CPU Features 20 MIPS IV Instruction Set Architecture 21 Cache Architecture 22 Level-1 Cache 22 Level-Two Cache 23 Out-of-Order and Speculative Execution 24 Executing Out of Order 24 Queued and Active Instructions 24 Speculative Execution 25 Summary 26 iv Contents 2. SN0 Memory Management 27 Dealing With Nonuniform Access Time 27 IRIX Memory Locality Management 29 Strategies for Memory Locality 29 Topology-aware Memory Allocation 29 Dynamic Page Migration 30 Replication of Read-Only Pages 30 Placing Processes Near Memory 30 Memory Affinity Scheduling 31 Support for Tuning Options 31 Memory Locality Management 31 Memory Locality Domain Use 31 Policy Modules 38 Memory Placement for Single-Threaded Programs 39 Data Placement Policies 40 Using First-Touch Placement 40 Using Round-Robin Placement 41 Using Fixed Placement 41 Achieving Good Performance in a NUMA System 42 Single-Threaded Programs under NUMA 42 Parallel Programs under NUMA 42 Summary 44 3. Tuning for a Single Process 45 Getting the Right Answers 46 Selecting an ABI and ISA 46 Old 32-Bit ABI 46 New 32-Bit ABI 47 64-Bit ABI 47 Specifying the ABI 47 Dealing with Porting Issues 48 Uninitialized Variables 48 Computational Differences 48 v Contents Exploiting Existing Tuned Code 49 Standard Math Library 49 libfastm Library 50 CHALLENGEcomplib Library 50 SCSL Library 51 Summary 51 4. Profiling and Analyzing Program Behavior 53 Profiling Tools 53 Analyzing Performance with Perfex 54 Taking Absolute Counts of One or Two Events 54 Taking Statistical Counts of All Events 55 Getting Analytic Output with the -y Option 56 Interpreting Maximum and Typical Estimates 58 Interpreting Statistical Metrics 59 Processing perfex Output 61 Collecting Data over Part of a Run 61 Using perfex with MPI 62 Using SpeedShop 62 Taking Sampled Profiles 63 Understanding Sample Time Bases 63 Sampling through Hardware Event Counters 65 Performing ssrun Experiments 65 Sampling Through Other Hardware Counters 66 Displaying Profile Reports from Sampling 67 Using Ideal Time Profiling 68 Capturing an Ideal Time Trace 69 Default Ideal Time Profile 69 Interpreting the Ideal Time Report 71 Removing Clutter from the Report 72 Including Line-Level Detail 73 Creating a Compiler Feedback File 75 Displaying Operation Counts 75 vi Contents Profiling the Call Hierarchy 76 Displaying Ideal Time Call Hierarchy 77 Displaying Usertime Call Hierarchy 79 Using Exception Profiling 81 Profiling Exception Frequency 81 Understanding Treatment of Underflow Exceptions 81 Using Address Space Profiling 82 Applying dprof 84 Interpreting dprof Output 85 Applying dlook 86 Summary 87 5. Using Basic Compiler Optimizations 89 Understanding Compiler Options 90 Recommended Starting Options 90 Compiler Option Groups 91 Compiler Defaults 92 Using a Makefile 92 Setting Optimization Level with -On 93 Start with -O2 for All Modules 93 Compile -O3 or -Ofast for Critical Modules 94 Use -O0 for Debugging 94 Setting Target System with -TARG 95 Understanding Arithmetic Standards 95 IEEE Conformance 96 Roundoff Control 97 Exploiting Software Pipelining 99 Understanding Software Pipelining 99 Pipelining the DAXPY Loop 101 Reading Software Pipelining Messages 105 Enabling Software Pipelining with -O3 108 Dealing with Software Pipelining Failures 108 vii Contents Informing the Compiler 109 Understanding Aliasing Models 109 Use Alias=Restrict When Possible 110 Use Alias=Disjoint When Necessary 112 Breaking Other Dependencies 115 Improving C Loops 118 Permitting Speculative Execution 121 Software Speculative Execution 121 Hardware Speculative Execution 122 Controlling the Level of Speculation 123 Passing a Feedback File 124 Exploiting Interprocedural Analysis 125 Requesting IPA 126 Compiling and Linking with IPA 126 Compile Time with IPA 127 Understanding Inlining 128 Using Manual Inlining 129 Using Automatic Inlining 132 IPA Programming Hints 134 Summary 134 6. Optimizing Cache Utilization 135 Understanding the Levels of the Memory Hierarchy 135 Understanding Level-One and Level-Two Cache Use 135 Understanding TLB and Virtual Memory Use 136 Degrees of Latency 137 Understanding Prefetching 137 Principles of Good Cache Use 138 Using Stride-One Access 138 Grouping Data Used at the Same Time 139 Understanding Cache Thrashing 140 Using Array Padding to Prevent Thrashing 142 viii Contents Identifying Cache Problems with Perfex and SpeedShop 142 Diagnosing and Eliminating Cache Thrashing 144 Diagnosing and Eliminating TLB Thrashing 145 Using Copying to Circumvent TLB Thrashing 146 Using Larger Page Sizes to Reduce TLB Misses 147 Using Other Cache Techniques 148 Understanding Loop Fusion 148 Understanding Cache Blocking 149 Understanding Transpositions 153 Summary 156 7. Using Loop Nest Optimization 157 Understanding Loop Nest Optimizations 157 Requesting LNO 158 Reading the Transformation File 158 Using Outer Loop Unrolling 159 Controlling Loop Unrolling 164 Using Loop Interchange 165 Combining Loop Interchange and Loop Unrolling 166 Controlling Cache Blocking 167 Adjusting Cache Blocking Block Sizes 168 Adjusting the Optimizer’s Cache Model 170 Using Loop Fusion and Fission 170 Using Loop Fusion 170 Using Loop Fission 171 Controlling Fission and Fusion 173 Using Prefetching 174 Prefetch Overhead and Unrolling 175 Using Pseudo-Prefetching 176 Controlling Prefetching 177 Using Manual Prefetching 178 ix Contents Using Array Padding 180 Using Gather-Scatter and Vector Intrinsics 182 Understanding Gather-Scatter 182 Vector Intrinsics 183 Summary 185 8. Tuning for Parallel Processing 187 Understanding Parallel Speedup and Amdahl’s Law 188 Adding CPUs to Shorten Execution Time 188 Understanding Parallel Speedup 189 Understanding Superlinear Speedup 190 Understanding Amdahl’s Law 190 Calculating the Parallel Fraction of a Program 191 Predicting Execution Time with n CPUs 192 Compiling Serial Code for Parallel Execution 193 Compiling a Parallel Version of a Program 193 Controlling a Parallelized Program at Run Time 193 Explicit Models of Parallel Computation 194 Fortran Source with Directives 194 C and C++ Source with Pragmas 195 Message-Passing Models MPI and PVM 195 C Source Using POSIX Threads 196 C and C++ Source Using UNIX Processes 196 Tuning Parallel Code for SN0 196 Prescription for Performance 197 Ensuring That the Program Is Properly Parallelized 197 Finding and Removing Memory Access Problems 198 Diagnosing

Load more