Analysis and Improvement of Performance

Analysis and Improvement of Performance and Power Consumption of Chip Multi- Threading SMP Architectures By: Ryan Eric Grant A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Master of Science (Engineering) Queen’s University Kingston, Ontario, Canada August, 2007 Copyright Ryan E. Grant, 2007 Abstract Emerging processor technologies are becoming commercially available that make multi-processor capabilities affordable for use in a large number of computer systems. Increasing power consumption by this next generation of processors is a growing concern as the cost of operating such systems continues to increase. It is important to understand the characteristics of these emerging technologies in order to enhance their performance. By understanding the characteristics of high performance computing workloads on real systems, the overall efficiency with which such workloads are executed can be increased. In addition, it is important to determine the best trade-off between system performance and power consumption using the variety of system configurations that are possible with these new technologies. This thesis seeks to provide a comprehensive presentation of the performance characteristics of several real commercially available simultaneous-multithreading multi- processor architectures and provide recommendations to improve overall system performance. As well, it will provide solutions to reduce the power consumption of such systems while minimizing the performance impact of these techniques on the system. The results of the research conducted show that the new scheduler proposed in this thesis is capable of providing significant increases in efficiency for traditional and emerging multi-processor technologies. These findings are confirmed using real system performance and power measurements. -i- Acknowledgements I would like to first thank my supervisor Dr. Ahmad Afsahi for his valuable feedback and support in conducting the research for this thesis and his assistance in assembling this document. Without his help this work would not have been possible. Many thanks to the office staff of the ECE Department here at Queen’s, particularly Ms. Debie Frasier and Ms. Bernice Ison for their invaluable help in administrative matters during the tenure of my Master’s degree. To my co-workers in the Parallel Processing Research Laboratory, Reza Zamani, Mohammad Rashti, and Ying Qian, thank you for your continued support over the last two years. Finally, my deepest appreciation goes to my immediate and extended family for their support throughout the years. A special and heartfelt thanks goes to my loving and understanding wife Meagan without whom this would not be possible. -ii- Table of Contents Abstract ......................................................................................................................................i Acknowledgements ...................................................................................................................ii Table of Contents .................................................................................................................... iii List of Tables.............................................................................................................................v List of Figures...........................................................................................................................vi Glossary..................................................................................................................................viii Chapter 1: Introduction..............................................................................................................1 1.1 Motivation................................................................................................................. 3 1.2 Contributions ............................................................................................................ 4 1.3 Outline ...................................................................................................................... 6 Chapter 2: Background..............................................................................................................7 2.1 Parallel Computing Architecture .............................................................................. 7 2.1.1 Prior Art ............................................................................................................. 9 2.2 Programming Shared Memory Parallel Applications............................................. 10 2.2.1 Prior Art ........................................................................................................... 12 2.3 System and Application Characterization............................................................... 13 2.3.1 Prior Art ........................................................................................................... 14 2.4 Scheduling and Operating System Noise................................................................ 15 2.4.1 Prior Art ........................................................................................................... 16 2.5 System Power Consumption................................................................................... 17 2.5.1 Power Metrics .................................................................................................. 18 2.5.2 Prior Art ........................................................................................................... 19 2.6 Summary................................................................................................................. 20 Chapter 3: Application Specifications.....................................................................................22 3.1 NAS Parallel Benchmarks ...................................................................................... 22 3.1.1 NAS Simulated CFD Applications .................................................................. 22 3.1.1.1 BT ............................................................................................................. 22 3.1.1.2 LU ............................................................................................................. 23 3.1.1.3 SP .............................................................................................................. 23 3.1.2 NAS Kernel Applications ................................................................................ 23 3.1.2.1 MG ............................................................................................................ 23 3.1.2.2 CG............................................................................................................. 24 3.1.3 SPEC OpenMP Benchmark Suite.................................................................... 24 Chapter 4: Workload Characteristics of SMT-Capable SMPs ................................................26 4.1 Experimental Setup................................................................................................. 26 4.1.1 Terminology..................................................................................................... 27 4.1.2 Application Characterization ........................................................................... 28 4.2 Experimental Results and Analysis ........................................................................ 29 4.2.1 EPCC................................................................................................................ 29 4.2.1.1 OpenMP Synchronization......................................................................... 30 4.2.1.2 OpenMP Scheduling ................................................................................. 31 4.2.2 Application Performance and Analysis ........................................................... 32 4.2.2.1 Trace Cache Analysis ............................................................................... 34 4.3 Summary................................................................................................................. 36 Chapter 5: Characterization and Analysis of Multi-Core SMPs .............................................38 -iii- 5.1 Experimental Methodology .................................................................................... 39 5.1.1 Terminology..................................................................................................... 40 5.2 Single Application Results...................................................................................... 42 5.2.1 Cache Performance .......................................................................................... 42 5.2.1.1 TLB Performance...................................................................................... 45 5.2.2 Stalled Operation ............................................................................................. 45 5.2.3 Branch Prediction............................................................................................. 46 5.2.4 Bus Transactions.............................................................................................. 47 5.2.5 Cycles Per Instruction...................................................................................... 47 5.2.6 Wall Clock Performance.................................................................................. 48 5.3 Multi-Application Results......................................................................................

Analysis and Improvement of Performance

SQL Server 2005 on the IBM Eserver Xseries 460 Enterprise Server

Support for NUMA Hardware in Helenos

A Novel Hard-Soft Processor Affinity Scheduling for Multicore Architecture Using Multiagents

Agent Based Load Balancing Scheme Using Affinity Processor Scheduling for Multicore Architectures

Optimizing Linux* for Dual-Core AMD Opteron™ Processors Optimizing Linux for Dual-Core AMD Opteron Processors

Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime

Introduction to Openmp

White Paper Processor Affinity Multiple CPU Scheduling

Affinity Mask Recommended Value: 0 to Carry out Multitasking, Microsoft Windows 2000 and Windows Server 2003 Sometimes Move Process Threads Among Different Processors

Red Hat Enterprise Linux 7 Performance Tuning Guide

Nighttune User's Guide

Multi-Core Object-Oriented Real-Time Pipeline