Operating System Support for Simultaneous Multithreaded Processors

UCAM-CL-TR-619 Technical Report ISSN 1476-2986 Number 619 Computer Laboratory Operating system support for simultaneous multithreaded processors James R. Bulpin February 2005 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2005 James R. Bulpin This technical report is based on a dissertation submitted September 2004 by the author for the degree of Doctor of Philosophy to the University of Cambridge, King's College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/TechReports/ ISSN 1476-2986 Summary Simultaneous multithreaded (SMT) processors are able to execute multiple application threads in parallel in order to improve the utilisation of the processor's execution resources. The improved utilisation provides a higher processor-wide throughput at the expense of the performance of each individual thread. Simultaneous multithreading has recently been incorporated into the Intel Pentium 4 processor family as “Hyper-Threading”. While there is already basic support for it in popular operating systems, that support does not take advantage of any knowledge about the characteristics of SMT, and therefore does not fully exploit the processor. SMT presents a number of challenges to operating system designers. The threads' dynamic sharing of processor resources means that there are complex performance interactions between threads. These interactions are often unknown, poorly understood, or hard to avoid. As a result such interactions tend to be ignored leading to a lower processor throughput. In this dissertation I start by describing simultaneous multithreading and the hardware implemen- tations of it. I discuss areas of operating system support that are either necessary or desirable. I present a detailed study of a real SMT processor, the Intel Hyper-Threaded Pentium 4, and describe the performance interactions between threads. I analyse the results using information from the processor's performance monitoring hardware. Building on the understanding of the processor's operation gained from the analysis, I present a design for an operating system process scheduler that takes into account the characteristics of the processor and the workloads in order to improve the system-wide throughput. I evaluate designs exploiting various levels of processor-specific knowledge. I finish by discussing alternative ways to exploit SMT processors. These include the partitioning onto separate simultaneous threads of applications and hardware interrupt handling. I present preliminary experiments to evaluate the effectiveness of this technique. 3 4 Acknowledgements I would like to thank my supervisor, Ian Pratt, for his advice and guidance. I am also grateful to my colleagues in the Systems Research Group and local research labs for their friendship, support and many interesting discussions. In particular thanks are due to Jon Crowcroft, Tim Deegan, Keir Fraser, Steve Hand, James Hall, Tim Harris, Richard Mortier, Rolf Neugebauer, Dave Stewart and Andrew Warfield. I would like to thank Keir Fraser, Ian Pratt and Tim Harris for proof reading earlier copies of this dissertation. Any errors that remain are my own. I spent a summer as an intern at Microsoft Research Cambridge. I would like to thank my mentor, Rebecca Isaacs, and the others involved with the “Magpie” project for an interesting and enjoyable break from my own research. Thanks are due to Intel Research Cambridge for their kind donation of a “Prescott” based computer which I used for some of the experiments in Chapter 3. My funding was from a CASE award from Marconi Corporation plc. and the Engineering and Physical Sciences Research Council. I would also like to thank my current employers for their patience and support while I finished writing this dissertation. 5 6 Table of contents Glossary 11 1 Introduction 13 1.1 Motivation . 13 1.2 Terminology . 14 1.3 Contribution . 15 1.4 Outline . 15 2 Background 17 2.1 Simultaneous Multithreading . 17 2.1.1 Multithreading . 19 2.1.2 Research Background of SMT . 20 2.2 Commercial SMT Processors . 21 2.2.1 Alpha 21464 . 21 2.2.2 Intel Hyper-Threading . 21 2.2.3 IBM Power5 . 24 2.3 Operating System Support . 26 2.3.1 An Extra Level of Hierarchy . 26 2.3.2 Simultaneous Execution . 27 2.3.3 Cache Considerations . 31 2.3.4 Energy Considerations . 34 2.4 Summary . 35 3 Measuring SMT 37 3.1 Related Work . 37 3.1.1 Simulated Systems . 37 3.1.2 Real Hardware . 38 3.2 Experimental Configurations . 39 3.2.1 Performance Counters . 39 3.2.2 Test Platforms . 40 3.2.3 Workloads . 41 3.2.4 Performance Metrics . 43 3.3 Thread Interactions . 44 3.3.1 Experimental Method . 45 3.3.2 Results . 46 3.3.3 Desktop Applications . 56 7 3.3.4 Summary . 58 3.4 Phases of Execution . 59 3.4.1 Performance Counter Correlation . 66 3.5 Asymmetry . 67 3.6 Summary . 69 4 A Process Scheduler for SMT Processors 71 4.1 The Context . 71 4.1.1 Problems with Traditional Schedulers . 72 4.1.2 Scheduling for Multithreading and Multiprocessing Systems . 72 4.2 Related Work . 73 4.2.1 Hardware Support . 73 4.2.2 SMT-Aware Scheduling . 74 4.3 Practical SMT-Aware Scheduling . 75 4.3.1 Design Space . 76 4.4 Implementation . 76 4.4.1 The Linux Scheduler . 77 4.4.2 Extensible Scheduler Modifications . 77 4.4.3 Performance Estimation . 78 4.4.4 SMT-Aware Schedulers . 79 4.5 Evaluation . 82 4.5.1 Method . 82 4.5.2 Throughput . 83 4.5.3 Fairness . 87 4.6 Applicability to Other Operating Systems . 92 4.7 Applicability to Other Processors . 93 4.8 Summary . 94 5 Alternatives to Multiprogramming 95 5.1 A Multithreaded Processor as a Single Resource . 95 5.1.1 Program Parallelisation . 96 5.2 Threads for Speculation and Prefetching . 96 5.2.1 Data Speculation . 96 5.2.2 Pre-execution . 98 5.2.3 Multiple Path Execution . 100 5.3 Threads for Management and Monitoring . 100 5.3.1 Mini-Threads . 101 5.3.2 Subordinate Threads . 101 5.4 Fault Tolerance . 104 5.5 Operating System Functions . 105 5.5.1 Exception Handling . 105 5.5.2 Privilege Level Partitioning . 106 5.5.3 Interrupt Handling Partitioning on Linux . 106 5.6 Summary . 113 8 6 Conclusions 115 6.1 Summary . 115 6.2 Further Research . 116 References 119 A Monochrome Figures 129 9 10 Glossary ALU Arithmetic-Logic Unit CMP Chip Multiprocessor/Multiprocessing DEC Digital Equipment Corporation D-TLB Data Translation Lookaside Buffer FP Floating Point HMT Hardware Multithreading IA32 Intel Architecture 32 bit IBM International Business Machines ILP Instruction Level Parallelism I/O Input/Output IPC Instructions per Cycle IQR Inter-Quartile Range (of a distribution) IRQ Interrupt Request ISA Instruction Set Architecture ISR Interrupt Service Routine I-TLB Instruction Translation Lookaside Buffer HP Hewlett-Packard (bought Compaq who bought DEC) HT Hyper-Threading L1/L2 Level1/Level2 cache LKM Linux/Loadable Kernel Module MP Multiprocessor/Multiprocessing MSR Model Specific Register (Intel) NUMA Non-uniform Memory Architecture OoO Out of Order (superscalar processor) OS Operating System P4 Intel Pentium 4 PID Process Identifier RISC Reduced Instruction Set Computer ROM Read-only Memory SMP Symmetric Multiprocessing SMT Simultaneous Multithreaded/Multithreading TC Trace-cache TLB Translation Lookaside Buffer TLP Thread Level Parallelism TLS Thread Level Speculation 11 UP Uniprocessor/Uniprocessing VM Virtual Memory 12 Chapter 1 Introduction This dissertation is concerned with the interaction of software, particularly operating systems, and simultaneous multithreaded (SMT) processors. I measure how a selection of workloads perform on SMT processors and show how the scheduling of processes onto SMT threads can cause the performance to change. I demonstrate how improved knowledge of the characteristics of the processor can improve the way the operating system schedules tasks to run. In this chapter I outline the ideas behind SMT processors and the difficulties they create. I state the contributions that are described in this dissertation. The chapter finishes with a brief summary of later chapters. 1.1 Motivation The difference in the rate of growth between processor speed and memory latency has led to an increasing delay (relative to processor cycles) to access memory. This is a well known problem and there are a number of techniques in common use to work around it. While caches are very effective, the cost of a level 1 cache miss causing an access to a lower level cache can cost many processor cycles. To try to minimise the effect of a cache miss on program execution, dynamic issue superscalar (or out-of-order, OoO) processors attempt to execute other instructions not de- pendent on the cache access while waiting for it to complete. Moreover, these processors have multiple execution units enabling multiple independent instructions to be executed in parallel. The efficacy of this approach depends on the amount of instruction level parallelism (ILP) available in the code being executed. If code exhibits low ILP then there are few opportunities for parallel execution and for finding sufficient work to perform while waiting for a cache miss. Simultaneous multithreading (SMT) is an extension of dynamic issue superscalar processing. The aim.

Load more