CS152: Computer Systems Architecture Multiprocessing and Parallelism
Sang-Woo Jun Winter 2019
Large amount of material adapted from MIT 6.004, “Computation Structures”, Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”, and CS 152 Slides by Isaac Scherson High performance via parallelism
Given a fixed amount of work, we want to do as much work at once as possible, to reduce the amount of time it takes o Subject to computing power restrictions o Subject to inherently exploitable parallelism in the work
What we looked at: Instruction-Level Parallelism o Pipelining works on multiple instructions, each at different stages • Subject to pipeline depth, subject to dependencies between instructions o Superscalar works on multiple instructions, even some on same stages • Subject to replicated pipeline count (ways), subject to dependencies between instructions Some more exploitable parallelism
Data parallelism o When processing on multiple data elements do not have much dependency between each other, they can be done in parallel o (Typically same work on multiple data)
Task parallelism Savni, “Data parallelism in matrix multiplication” o Work can be divided into multiple independent processes, which do not have much dependency between each other o (Typically different work) Of course distinction is often muddled in the real world How much performance improvement?
Amdahl’s law (Gene Amdahl, 1967) o Theoretical speedup for a given job as system resources are improved 1 o Speedup = 푝 1−푝 +( ) 푠 • p is the fraction of the parallelizable portion of the job • s is the improvement of system resources (e.g., processor count)
o If p is zero, there is no speedup no matter how much processors we have! o If p is 1, we will have perfect scaling of performance with more processors Assume no synchronization overhead for upper bound estimation How much performance improvement?
Example: Can we get 90× speedup using 100 processors? 1 o Remember: Speedup = 푝 1−푝 +( ) 푠 o s = 100 with 100 processors, and we want speedup to be 90 1 Speedup 90 (1Fparallelizable ) Fparallelizable /100
o Solving: Fparallelizable = 0.999
Need non-parallelizable part to be 0.1% of original time Scaling Example
Workload: sum of 10 scalars, and 10 × 10 matrix sum
Single processor: Time = (10 + 100) × tadd 10 processors o Assuming adding 10 scalars can’t be parallelized
o Time = 10 × tadd + 100/10 × tadd = 20 × tadd o Speedup = 110/20 = 5.5 (55% of potential peak) 100 processors
o Time = 10 × tadd + 100/100 × tadd = 11 × tadd o Speedup = 110/11 = 10 (10% of potential peak) Scaling Example
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd 10 processors
o Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd o Speedup = 10010/1010 = 9.9 (99% of potential peak) 100 processors
o Time = 10 × tadd + 10000/100 × tadd = 110 × tadd o Speedup = 10010/110 = 91 (91% of potential peak) Why focus on parallelism?
Of course, large jobs require large machines with many processors o Exploiting parallelism to make the best use of supercomputers have always been an extremely important topic But now even desktops and phones are multicore! o Why? The end of “Dennard Scaling”
What happened? Dennard Scaling
Robert Dennard, 1974 “Power density stays constant as transistors get smaller” Intuitively: o Smaller transistors → shorter propagation delay → faster frequency o Smaller transistors → smaller capacitance → lower voltage
o 푃표푤푒푟 ∝ 퐶푎푝푎푐푖푡푎푛푐푒 × 푉표푙푡푎푔푒2 × 퐹푟푒푞푢푒푛푐푦 o Moore’s law → Faster performance @ Constant power!
Unfortunately, as transistors became too small, there was a limit to how low voltage can go Option 1: Continue Scaling Frequency at Increased Power Budget
0.014 μ Option 2: Stop Frequency Scaling
Dennard Scaling Ended (~2006)
Danowitz et.al., “CPU DB: Recording Microprocessor History,” Communications of the ACM, 2012 Looking Back: Change of Predictions But Moore’s Law Continues Beyond 2006 State of Things at This Point (2006)
Single-thread performance scaling ended o Frequency scaling ended (Dennard Scaling) o Instruction-level parallelism scaling stalled … also around 2005
Moore’s law continues Instruction Level Parallelism o Double transistors every two years o What do we do with them?
K. Olukotun, “Intel CPU Trends” Crisis Averted With Manycores? Crisis Averted With Manycores?
We’ll get back to this point later. For now, multiprocessing! The hardware for parallelism: Flynn taxonomy (1966) recap
Data Stream Single Multi Instruction Single SISD SIMD Stream (Single-Core Processors) (GPUs, Intel SSE/AVX extensions, …)
Multi MISD MIMD (Systolic Arrays, …) (VLIW, Parallel Computers) Flynn taxonomy
Single-Instruction Single-Instruction Single-Data Multi-Data (Single-Core Processors) (GPUs, Intel SIMD Extensions)
Today
Multi-Instruction Multi-Instruction Single-Data Multi-Data (Systolic Arrays,…) (Parallel Processors) Shared memory multiprocessor
SMP: shared memory multiprocessor o Hardware provides single physical address space for all processors o Synchronize shared variables using locks o Memory access time • UMA (uniform) vs. NUMA (nonuniform) Also often SMP: Symmetric multiprocessor o The processors in the system are identical, and are treated equally
Typical chip-multiprocessor (“multicore”) consumer computers UMA between cores sharing a package, But NUMA across cores in different packages. Memory System Architecture Overall, this is a NUMA system Package Package
Core Core Core Core
L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$
L2 $ L2 $ L2 $ L2 $
L3 $ L3 $
QPI / UPI
DRAM DRAM Memory System Bandwidth Snapshot (2020)
Cache Bandwidth Estimate 64 Bytes/Cycle ~= 200 GB/s/Core Processor Processor DDR4 2666 MHz 128 GB/s
QPI / UPI
DRAM Ultra Path Interconnect DRAM Unidirectional 20.8 GB/s
Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die All sorts of things are now on-die! Even network controllers! Memory system issues with multiprocessing
Suppose two CPU cores share a physical address space o Distributed caches (typically L1) o Write-through caches, but same problem for write-back as well
Time Event CPU A’s CPU B’s Memory step cache cache 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
Wrong data! Memory system issues with multiprocessing
What are the possible outcomes from the two following codes? o A and B are initially zero
Processor 1: Processor 2:
1: A = 1; 3: B = 1; 2: print B 4: print A
o 1,2,3,4 or 3,4,1,2 etc : “01” o 1,3,2,4 or 1,3,4,2 etc : “11”
o Can it print “10”, or “00”? Should it be able to? Memory problems with multiprocessing
Cache coherency (The two CPU example) o Informally: Read to each address must return the most recent value o Complex and difficult with many processors o Typically: All writes must be visible at some point, and in proper order
Memory consistency (The two processes example) o How updates to different addresses become visible (to other processors) o Many models define various types of consistency • Sequential consistency, causal consistency, relaxed consistency, … o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly Keeping multiple caches coherent
A bad idea: Broadcast all writes to all other cores o Lots of unnecessary on-chip network traffic o Each cache needs to do many lookups to see if incoming write notifications need to be applied • Further bad idea: applying all write requests result in all caches have copies of all data No point of having distributed cache! Keeping multiple caches coherent
Basic idea o If a cache line is only read, many caches can have a copy o If a cache line is written to, only one cache at a time may have a copy
Typically two ways of implementing this o “Snooping-based”: All cores listen to requests made by others on the memory bus o “Directory-based”: All cores consult a separate entity called “directory” for each cache access Snoopy cache coherence
All caches listen (“snoop”) to the traffic on the memory bus o There is enough information just from that to ensure coherence
Core Core Core
Cache Cache Cache
Shared Cache Memory bus
DRAM Detour: A bus interconnect
A bus is an interconnect that connects multiple entities o Cores, memory, (peripheral devices…) Clock-synchronous, shared resource o Only one entity may be transmitting at any given clock cycle o All data transfers are broadcast, and all entities on the bus can listen to all communication o If multiple entities want to send data (a “multi-master” configuration) a separate entity called the “bus arbiter” must assign which master can write at a given cycle Contrary to switched interconnects like packet-switched, or crossbar o We’ll talk about these later! MSI: A simple cache coherency protocol
Each cache line can be in one of three states o M (“Modified”): Dirty data, no other cache has copy • Safe to write to cache. DRAM is out-of-date o S (“Shared”): Not dirty data, other caches may have copy o I (“Invalid”): Data not in cache Cache tag augmented with state information
Core Core Core
Cache Cache Cache Snoopy implementation of MSI
Originally, each cache could send two requests on the memory bus o Read: For both read and write requests from core (write-back) o Flush: For cache evictions Cache is modified with two more requests o Read: For read requests from core o ReadExclusive: For write requests from core • Data is read from DRAM to cache, with the intention to update o Upgrade: When a previously read cache line will be written to o Flush: Writeback to DRAM, in the case of cache evictions Snoopy implementation of MSI
Core wants to read: Read from DRAM (via “Read” command), set to “S” Core wants to write: o If cache has data: Send “Upgrade” command, set to “M” o If not: Read from DRAM (via “Read exclusive” command), set to “M” Snoops “Read” command from another core: o If I have data as “S”, supply data o If I have data as “M”, flush and downgrade myself to “S” Snoops “ReadExclusive” or “Upgrade”: o If I have data as “S”: invalidate o If I have data as “M”: flush, invalidate Impossible to snoop “Upgrade” while I am “M”. I should be the only one with this data while I am “M” Snoopy implementation of MSI
Jos van Eijndhoven, et. al., “Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices” Back to coherence example
Writes invalidate other caches, ensuring correct coherence o This time assuming write-back
Time Event CPU A’s CPU B’s Memory step cache cache 0 0 1 CPU A reads X 0 (S) 0 2 CPU B reads X 0 (S) 0 (S) 0 3 CPU A writes 1 to X 1 (M) (I) 0 4 CPU B reads X 1 (S) 1 (S) 1 Beyond MSI
Many improvements on the MSI protocol proposed, implemented o MESI (adds “exclusive”), MOESI (adds “owner”), MESIF (adds “forward”) Example: MESI o Addresses situation where data is read, and then updated in memory by one core • Very common situation! (LW, process, SW) • MSI would send Read, Upgrade commands, but if no other processes access this cache line, perhaps upgrade is unnecessary o When MESI snoops a Read command, each core checks its cache to see if it has that line • If cache exists in another core, data is copied between caches (new command) • If not, cache is in “E” state, which can upgrade to “M” without consulting others
Cache-to-cache data copy reduces main memory accesses in practice, performance improvement! Limitations of snoopy caches
Inter-core cache coherence communications are all broadcast o Memory bus becomes a single point of serialization o Not very scalable beyond a handful of cores Perhaps not that big of an issue with desktop-class CPUs o With ~8 cores and typical ratio of memory instructions, bus not yet the primary bottleneck o At least Intel Haswell seems to be still using snoopy • Rumored to use adaptive MESIF protocol Directory-based cache coherence
Uses a separate entity called a “directory” which keeps track of the cache line state (MSI?) o Cache accesses first consult the directory, which either interacts with DRAM or remote caches o Directory can be implemented in a distributed way, to disperse the bottleneck Pros of directory-based coherence o Given a scalable interconnect, much more scalable than a bus Cons of directory-based coherence o Much more complex than snoopy, involving circuit implementations of distributed algorithms
We won’t cover this in detail right now… Performance issue with cache coherence: False sharing Different memory locations, written to by different cores, mapped to same cache line o Core 1 performing “results[0]++;” o Core 2 performing “results[1]++;” Remember cache coherence o Every time a cache is written to, all other instances need to be invalidated! o “results” variable is ping-ponged across cache coherence every time o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI) Solution: Store often-written data in local variables Some performance numbers with false sharing Memory problems with multiprocessing
Cache coherency (The two CPU example) o Informally: Read to each address must return the most recent value o Complex and difficult with many processors o Typically: All writes must be visible at some point, and in proper order
Memory consistency (The two processes example) o How updates to different addresses become visible (to other processors) o Many models define various types of consistency • Sequential consistency, causal consistency, relaxed consistency, … o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly Again: Memory consistency issue example
What are the possible outcomes from the two following codes? o A and B are initially zero
Processor 1: Processor 2:
1: A = 1; 3: B = 1; 2: print B 4: print A
o 1,2,3,4 or 3,4,1,2 etc : “01” o 1,3,2,4 or 1,3,4,2 etc : “11”
o Can it print “10”, or “00”? Should it be able to? • Given cache coherency? Why we need a consistency model
Given an in-order multicore processor, cache coherence should solve memory consistency automatically o But in-order leaves a lot of performance on the table o e.g., Out-of-Order superscalar brings a lot of performance gains From a single processor’s perspective, it looks like there is no dependency!
A non-OoO inconsistency example: write buffers o In order to hide cache miss latency for writes, write requests are often stored in a small “write buffer”, instead of waiting until the cache processing is done o Requests in the write buffer may be applied out-of-order, depending on cache operation o Remote core may see memory writes applied out of order! (e.g., “00”) Memory model is specified as a part of the ISA Why we need a consistency model
Once requests can be haphazardly reordered, algorithmic correctness becomes really hard to enforce! o e.g., How would we implement a mutex? o If “locked=true” is buffered in a write buffer while a process is in a critical section… Memory consistency model: “A specification of the allowed behavior of multithreaded programs executing with shared memory” Memory model is defined, specified as part of the ISA o Compilers, programmers write/optimize software to work with the model o Simpler model (typically) simpler for compiler, complex hardware o Complex model (typically) harder for compiler, simple hardware The most stringent model: “Sequential consistency” (SC) The following is what the model requires, and the processor hardware must abide by Within a core, all memory I/O is applied in program order o Regardless of OoO or write buffers, memory I/O leaving the core should be re- ordered according to program order Across cores, all cores must see the same global order o Typically solved already via cache coherency Parallel program execution behaves “as you would expect” Too stringent for efficient hardware implementation
SC requires I/O with no dependency to be ordered as well! Weaker memory models for performance
To satisfy SC, 2 and 4 can’t execute until 1 and Processor 1: Processor 2: 3 become visible globally 1: A = 1; 3: B = 1; o 2 and 4 stalled until writes flushed to L3 cache! 2: print B 4: print A Instead, the two writes can be buffered in Write buffers write buffers, and continue on to reads o Reads will check write buffers as well as cache, Core Core ensuring correct single-thread behavior L1 I$ L1 D$ L1 I$ L1 D$ o Significant performance improvement* o BUT, this makes “00” or “10” possible L2 $ L2 $
L3 $
Package *Exact impact has been debated, especially with speculation-based improvements on SC A weaker memory model: “Total Store Order” (TSO) Allows later reads to finish before earlier writes o Read and writes among themselves must maintain program order o But ordering between reads and writes can be changed Processor 1: Processor 2: o Basically, model allows write buffers 1: A = 1; 3: B = 1; Now our example can emit both “00” and “10” 2: print B 4: print A
It is important that weaker models do not effect single-core ordering, or access to the same location o Intra-core dependencies are always maintained o Effects only the order in which memory I/O is visible to other cores
Aside: x86 seems to follow a modified version of TSO Aside: Weaker memory models in the wild
A wide variety of weaker memory models exist o Processor Consistency (PC) o Release Consistency (RC) o Eventual Consistency (EC), … Some known memory models o x86 seems to follow a modified form of TSO o ARM model is notoriously underspecified o RISC-V allows choice between RC and TSO How beneficial are these memory models is debated… o TSO consistently outperforms SC, but everything beyond is muddy Weaker memory models and synchronization How do we implement synchronization in the face of memory re- ordering? A new instruction “Fence” specifies where ordering must be enforced o All memory operations before a fence must finish before fence is executed • And visible to other processors/threads! o All memory operations after a fence must wait until fence is executed
RISC-V base ISA includes two fence operations o FENCE, FENCE.I (for instruction memory) So far…
Why parallelism? Memory issues with parallelism o Cache coherence o Memory consistency Hardware support for synchronization
In parallel software, critical sections implemented via mutexes are critical for algorithmic correctness Can we implement a mutex with the instructions we’ve seen so far? o e.g., while (lock==False); lock = True; // critical section lock = False; o Does this work with parallel threads? Hardware support for synchronization
By chance, both threads can think lock is not taken o e.g., Thread 2 thinks lock is not taken, before thread 1 takes it o Both threads think they have the lock
Thread 1 Thread 2
while (lock==False); while (lock==False);
lock = True; lock = True;
// critical section // critical section
lock = False; lock = False;
Algorithmic solutions exist! Dekker’s algorithm, Lamport’s bakery algorithm… Synchronization algorithm example: Dekker’s algorithm
Does this still work with TSO?
Also, performance overhead…
The solution is attributed to Dutch mathematician Th. J. Dekker by Edsger W. Dijkstra in an unpublished paper Hardware support for synchronization
A high-performance solution is to add an “atomic instruction” o Memory read/write in a single instruction o No other instruction can read/write between the atomic read/write o e.g., “if (lock=False) lock=True” Fences are algorithmically sufficient to implement atomic operations, but this is more efficient
Single instruction read/write is in the grey area of RISC paradigm… RISC-V example
Atomic instructions are provided as part of the “A” (Atomic) extension Two types of atomic instructions o Atomic memory operations (read, operation, write) • operation: swap, add, or, xor, … o Pair of linked read/write instructions, where write returns fail if memory has been written to after the read • More like RISC! • With bad luck, may cause livelock, where writes always fail Aside: It is known all synchronization primitives can be implemented with only atomic compare-and-swap (CAS) o RISC-V doesn’t define a CAS instruction though Pipelined implementation of atomic operations In a pipelined implementation, even a single-instruction read-modify- write can be interleaved with other instructions o Multiple cycles through the pipeline
Atomic memory operations o Modify cache coherence so that once an atomic operation starts, no other cache can access it o Other solutions? Linked read/write operations o Need only to keep track of a small number of linked read addresses