Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Total Page:16
File Type:pdf, Size:1020Kb
Front cover Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8 Peter Bergner Bernard King Smith Brian Hall Julian Wang Alon Shalev Housfater Suresh Warrier Madhusudanan Kandasamy David Wendt Tulio Magno Alex Mericas Steve Munroe Mauricio Oliveira Bill Schmidt Will Schmidt Redbooks International Technical Support Organization Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8 August 2015 SG24-8171-01 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. Second Edition (August 2015) This edition pertains to IBM Power Systems servers based on IBM Power Systems processor-based technology, including but not limited to IBM POWER8 processor-based systems. Specific software levels and firmware levels that are used are noted throughout the text. © Copyright International Business Machines Corporation 2014, 2015. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . ix Trademarks . .x IBM Redbooks promotions . xi Preface . xiii Authors. xiii Now you can become a published author, too! . xvii Comments welcome. xvii Stay connected to IBM Redbooks . xvii Summary of changes. xix August 2015, Second Edition. xix Chapter 1. Optimization and tuning on IBM POWER8 processor-based systems . 1 1.1 Introduction . 2 1.2 Outline of this guide . 2 1.3 Conventions that are used in this guide . 5 1.4 Background . 5 1.5 Optimizing performance on POWER8 processor-based systems. 6 1.5.1 Lightweight tuning and optimization guidelines. 7 1.5.2 Deployment guidelines . 15 1.5.3 Deep performance optimization guidelines. 21 Chapter 2. The IBM POWER8 processor. 25 2.1 Introduction to the POWER8 processor . 26 2.2 Using POWER8 features . 28 2.2.1 Multi-core and multi-thread . 28 2.2.2 Multipage size support (page sizes (4 KB, 64 KB, 16 MB, and 16 GB)). 32 2.2.3 Efficient use of cache and memory. 33 2.2.4 Transactional memory. 42 2.2.5 Vector Scalar eXtension . 45 2.2.6 Decimal floating point . 47 2.2.7 In-core cryptography and integrity enhancements . 47 2.2.8 On-chip accelerators. 48 2.2.9 Storage synchronization (sync, lwsync, lwarx, stwcx., and eieio). 49 2.2.10 Fixed-point load and store quadword instructions. 51 2.2.11 Instruction fusion. 51 2.2.12 Event-based branches (or user-level fast interrupts) . 52 2.2.13 Power management and system performance . 52 2.2.14 Coherent Accelerator Processor Interface . 53 2.3 I/O adapter affinity. 55 2.4 Related publications . 55 Chapter 3. The IBM POWER Hypervisor . 57 3.1 Introduction to PowerVM. 58 3.2 Power Systems virtualization with PowerVM . 59 3.2.1 Virtual processors . 59 3.2.2 Page table sizes for LPARs . 63 © Copyright IBM Corp. 2014, 2015. All rights reserved. iii 3.2.3 Placing LPAR resources to attain higher memory affinity. 63 3.2.4 Active memory expansion. 66 3.2.5 Optimizing resource placement: Dynamic Platform Optimizer . 67 3.2.6 Partition compatibility mode . 67 3.3 Introduction to KVM Virtualization . 67 3.4 Related publications . 68 Chapter 4. IBM AIX . 71 4.1 Introduction . 72 4.2 Using Power Architecture features with AIX . 72 4.2.1 Multi-core and multi-thread . 72 4.2.2 Multipage size support on AIX . 83 4.2.3 Efficient use of cache . 86 4.2.4 Transactional memory. 89 4.2.5 Vector Scalar eXtension . 91 4.2.6 Decimal floating point . 92 4.2.7 On-chip encryption accelerator . 94 4.3 AIX operating system-specific optimizations. 95 4.3.1 Malloc . 95 4.3.2 Pthread tunables. 97 4.3.3 pollset . 98 4.3.4 File system performance benefits . 98 4.3.5 Direct I/O. 98 4.3.6 Concurrent I/O . 99 4.3.7 Asynchronous I/O . 99 4.3.8 I/O completion ports . 100 4.3.9 shmat versus mmap . 100 4.3.10 Large segment tunable aliasing (LSA) . 101 4.3.11 64-bit versus 32-bit ABIs. 101 4.3.12 Sleep and wake-up primitives (thread_wait and thread_post) . 102 4.3.13 Shared versus private loads . 103 4.3.14 Workload partition shared licensed program installations. ..