The POWER4 Processor Introduction and Tuning Guide

Front cover The POWER4 Processor Introduction and Tuning Guide Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler Steve Behling Ron Bell Peter Farrell Holger Holthoff Frank O’Connell Will Weir ibm.com/redbooks International Technical Support Organization The POWER4 Processor Introduction and Tuning Guide November 2001 SG24-7041-00 Take Note! Before using this information and the product it supports, be sure to read the general information in “Special notices” on page 175. First Edition (November 2001) This edition applies to AIX 5L for POWER Version 5.1 (program number 5765-E61), XL Fortran Version 7.1.1 (5765-C10 and 5765-C11) and subsequent releases running on an IBM ^ pSeries POWER4-based server. Unless otherwise noted, all performance values mentioned in this document were measured on a 1.1 GHz machine, then normalized to 1.3 GHz. Note: This book is based on a pre-GA version of a product and may not apply when the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current information. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp. Contents Figures . vii Tables . .ix Preface . .xi The team that wrote this redbook. xii Notice . xiii IBM trademarks . xiv Comments welcome. xiv Chapter 1. Processor evolution . 1 1.1 POWER1 . 1 1.2 POWER2 . 2 1.3 PowerPC . 2 1.4 RS64 . 3 1.5 POWER3 . 4 1.6 POWER4 . 4 Chapter 2. The POWER4 system . 5 2.1 POWER4 system overview. 5 2.2 The POWER4 chip . 6 2.3 Processor overview . 8 2.3.1 The POWER4 processor execution pipeline. 9 2.3.2 Instruction fetch, group formation, and dispatch . 9 2.3.3 Instruction execution, speculation, rename resources . 11 2.3.4 Branch prediction . 12 2.3.5 Translation buffers (TLB, SLB, I- and D-ERAT) . 13 2.3.6 Load instruction processing . 13 2.3.7 Store instruction processing . 14 2.3.8 Fixed-point execution pipeline. 15 2.3.9 Floating-point execution pipeline. 15 2.3.10 Group completion . 16 2.4 Storage hierarchy. 16 2.4.1 L1 instruction cache . 17 2.4.2 L1 data cache . 17 2.4.3 L2 cache . 17 2.4.4 L3 cache . 18 2.4.5 Interconnecting chips to form larger SMPs . 18 2.4.6 Multiple module interconnect . 19 © Copyright IBM Corp. 2001 iii 2.4.7 Memory subsystem . 20 2.4.8 Hardware data prefetch. 21 2.4.9 Memory/L3 cache command queue structure. 22 2.5 I/O structure . 23 2.6 The POWER4 Performance Monitor . 23 Chapter 3. POWER4 system performance and tuning . 25 3.1 Tuning for numerically intensive applications . 25 3.1.1 The tuning process for numerically intensive applications . 26 3.1.2 Hand tuning overview for numerically intensive programs . 26 3.1.3 Key aspects of the POWER4 design . 27 3.1.4 Tuning for the memory subsystem . 34 3.1.5 Tuning for the FPUs . 40 3.1.6 Cache and memory latency measurement . 47 3.1.7 Selected fundamental kernel performance within on-chip cache . 49 3.1.8 Other tuning considerations . 51 3.2 Tuning non-floating point applications . 52 3.2.1 The load/store and integer units . 52 3.2.2 Memory configurations . 53 3.3 System tuning . 54 3.3.1 POWER4 virtual memory architecture overview . 54 3.3.2 Small and large page sizes . 58 3.3.3 AIX system parameters. 61 3.3.4 Minimizing variation in job performance . 67 Chapter 4. Optimizing with the compilers. 69 4.1 POWER4-specific compiler options . 69 4.1.1 General performance options . 70 4.1.2 Options for POWER4 . 75 4.1.3 Using XL Fortran vector-intrinsic functions . 76 4.1.4 Recommended options . 79 4.1.5 Comparing C and Fortran compiler code generation . 79 4.2 XL Fortran compiler directives for tuning . 80 4.2.1 Prefetch directives. 81 4.2.2 Loop-related directives . 82 4.2.3 Cache and other directives . 83 4.3 The object code listing . 84 4.4 Basic coding practices for performance . 88 4.4.1 Language-independent tips. 88 4.4.2 Fortran tips . 89 4.4.3 C and C++ tips . 89 4.4.4 Inlining procedure references . 90 4.4.5 Structuring code for optimal grouping . 91 iv POWER4 Processor Introduction and Tuning Guide 4.5 Tuning for 64-bit integer performance. 91 Chapter 5. General tuning guidelines . 93 5.1 Hand tuning code . 93 5.1.1 Local or global variables? . 93 5.1.2 Pointers . 94 5.1.3 Expressions. 94 5.1.4 Data type conversions. 95 5.1.5 Tuning loops . 95 5.2 Using pre-tuned code . 101 5.3 The performance monitor . 101 5.4 Tuning for I/O . ..

The POWER4 Processor Introduction and Tuning Guide

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

IBM Iseries Universal Connection for Electronic Support and Services

Caching Basics

45-Year CPU Evolution: One Law and Two Equations

Copyrighted Material

Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM AIX Version 6.1 Differences Guide

IBM Powerpc 970 (A.K.A. G5)