CSC 256/456: Operating Systems

CSC 256/456: Operating Systems Multiprocessor John Criswell Support University of Rochester 1 Outline ❖ Multiprocessor hardware ❖ Types of multi-processor workloads ❖ Operating system issues ❖ Where to run the kernel ❖ Synchronization ❖ Where to run processes 2 Multiprocessor Hardware 3 Multiprocessor Hardware ❖ System in which two or more CPUs share full access to the main memory ❖ Each CPU might have its own cache and the coherence among multiple caches is maintained CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 4 Multi-core Processor ❖ Multiple processors on “chip” ❖ Some caches shared CPU CPU ❖ Some not shared Cache Cache Shared Cache Memory Bus 5 Hyper-Threading ❖ Replicate parts of processor; share other parts ❖ Create illusion that one core is two cores CPU Core Fetch Unit Decode ALU MemUnit Fetch Unit Decode 6 Cache Coherency ❖ Ensure processors not operating with stale memory data ❖ Writes send out cache invalidation messages CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 7 Non Uniform Memory Access (NUMA) ❖ Memory clustered around CPUs ❖ For a given CPU ❖ Some memory is nearby (and fast) ❖ Other memory is far away (and slow) 8 Multiprocessor Workloads 9 Multiprogramming ❖ Non-cooperating processes with no communication ❖ Examples ❖ Time-sharing systems ❖ Multi-tasking single-user operating systems ❖ make -j<very large number here> 10 Concurrent Servers ❖ Minimal communication between processes and threads ❖ Throughput usually the goal ❖ Examples ❖ Web servers ❖ Database servers 11 Parallel Programs ❖ Use parallelism to speed up computation ❖ Significant data sharing between processes and threads ❖ Examples ❖ Gaussian Elimination ❖ Matrix multiply 12 Operating System Issues 13 Three Challenges ❖ Where to run the OS ❖ How to do synchronization ❖ Where to schedule processes 14 Where to Run the OS? 15 Multiprocessor OS Bus ❖ Each CPU has its own operating system ❖ quick to port from a single-processor OS ❖ Disadvantages ❖ difficult to share things (processing cycles, memory, buffer cache) 16 Multiprocessor OS – Master/Slave Bus ❖ All operating system functionality goes to one CPU ❖ no multiprocessor concurrency in the kernel ❖ Disadvantage ❖ OS CPU consumption may be large so the OS CPU becomes the bottleneck (especially in a machine with many CPUs) 17 Multiprocessor OS – Shared OS ❖ A single OS instance may run on all CPUs ❖ The OS itself must handle multiprocessor synchronization ❖ multiple OS instances from multiple CPUs may access shared data structure Bus 18 Synchronization Issues 19 Synchronization ❖ Traditional atomic operations write memory ❖ Atomic compare and swap ❖ Atomic fetch and add ❖ This is very bad for multi-processors 20 Load-Linked and Store-Conditional ❖ Load-linked sets a bit in the cache line ❖ Cache invalidation clears bit ❖ Store-conditional checks that bit is still set in cache line ❖ Cleared bit means a cache invalidation occurred ❖ Which implies that another core wrote the memory ❖ Which implies that atomicity was violated 21 Performance Measures for Synchronization ❖ Latency ❖ Cost of thread management under the best case assumption of no contention for locks ❖ Throughput ❖ Rate at which threads can be created, started, and finished when there is contention 22 Synchronization (Fine/Coarse-Grain Locking) ❖ Fine-grain locking – lock only what is necessary for critical section ❖ Coarse-grain locking – locking large piece of code, much of which is unnecessary ❖ simplicity, robustness ❖ prevent simultaneous execution ❖ simultaneous execution is not possible on uniprocessor anyway 23 Synchronization Optimizations ❖ Avoid synchronization ❖ Per-processor data structures ❖ Lock-free data structures ❖ Use fine-grained locking to increase throughput ❖ Reuse (cache) data structures ❖ Allocation/deallocation just a few pointer operations ❖ Locks not held for very long 24 Where to Run Processes? 25 Multiprocessor Scheduling ❖ Affinity-based scheduling ❖ Try to run each process on the processor that it last ran on ❖ Takes advantage of cache locality CPU 0 CPU 1 web server parallel Gaussian client/server elimination game (civ) 26 Multiprocessor Scheduling ❖ Gang/Cohort scheduling ❖ Utilize all CPUs for one parallel/concurrent application at a time ❖ Cache sharing and synchronization of parallel/ concurrent applications CPU 0 CPU 1 web server parallel Gaussian client/server elimination game (civ) 27 Resource Management To Date • Capitalistic - generation of more requests results in more resource usage – Performance: resource contention can result in significantly reduced overall performance – Fairness: equal time slice does not necessarily guarantee equal progress 28 Fairness and Security Concerns ❖ Priority inversion ❖ Poor fairness among competing applications ❖ Information leakage at chip level ❖ Denial of service attack at chip level 29 Disclaimer • Parts of the lecture slides contain original work by Andrew S. Tanenbaum. The slides are intended for the sole purpose of instruction of operating systems at the University of Rochester. All copyrighted materials belong to their original owner(s). 30.

CSC 256/456: Operating Systems

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

Efficient Synchronization Mechanisms for Scalable GPU Architectures

CIS 501 Computer Architecture This Unit: Shared Memory

Download Slides As

A Survey of Cache Coherence Schemes for Multiprocessors

In-Network Cache Coherence

Invisispec: Making Speculative Execution Invisible in the Cache Hierarchy