CS152: Computer Systems Architecture Multiprocessing and Parallelism

Sang-Woo Jun Winter 2019

Large amount of material adapted from MIT 6.004, “Computation Structures”, Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”, and CS 152 Slides by Isaac Scherson High performance via parallelism

 Given a fixed amount of work, we want to do as much work at once as possible, to reduce the amount of time it takes o Subject to computing power restrictions o Subject to inherently exploitable parallelism in the work

 What we looked at: Instruction-Level Parallelism o Pipelining works on multiple instructions, each at different stages • Subject to pipeline depth, subject to dependencies between instructions o Superscalar works on multiple instructions, even some on same stages • Subject to replicated pipeline count (ways), subject to dependencies between instructions Some more exploitable parallelism

 Data parallelism o When processing on multiple data elements do not have much dependency between each other, they can be done in parallel o (Typically same work on multiple data)

 Task parallelism Savni, “Data parallelism in matrix multiplication” o Work can be divided into multiple independent processes, which do not have much dependency between each other o (Typically different work) Of course distinction is often muddled in the real world How much performance improvement?

 Amdahl’s law (Gene Amdahl, 1967) o Theoretical speedup for a given job as system resources are improved 1 o Speedup = 푝 1−푝 +( ) 푠 • p is the fraction of the parallelizable portion of the job • s is the improvement of system resources (e.g., processor count)

o If p is zero, there is no speedup no matter how much processors we have! o If p is 1, we will have perfect scaling of performance with more processors  Assume no synchronization overhead for upper bound estimation How much performance improvement?

 Example: Can we get 90× speedup using 100 processors? 1 o Remember: Speedup = 푝 1−푝 +( ) 푠 o s = 100 with 100 processors, and we want speedup to be 90 1 Speedup   90 (1Fparallelizable ) Fparallelizable /100

o Solving: Fparallelizable = 0.999

 Need non-parallelizable part to be 0.1% of original time Scaling Example

 Workload: sum of 10 scalars, and 10 × 10 matrix sum

 Single processor: Time = (10 + 100) × tadd  10 processors o Assuming adding 10 scalars can’t be parallelized

o Time = 10 × tadd + 100/10 × tadd = 20 × tadd o Speedup = 110/20 = 5.5 (55% of potential peak)  100 processors

o Time = 10 × tadd + 100/100 × tadd = 11 × tadd o Speedup = 110/11 = 10 (10% of potential peak) Scaling Example

 What if matrix size is 100 × 100?

 Single processor: Time = (10 + 10000) × tadd  10 processors

o Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd o Speedup = 10010/1010 = 9.9 (99% of potential peak)  100 processors

o Time = 10 × tadd + 10000/100 × tadd = 110 × tadd o Speedup = 10010/110 = 91 (91% of potential peak) Why focus on parallelism?

 Of course, large jobs require large machines with many processors o Exploiting parallelism to make the best use of supercomputers have always been an extremely important topic  But now even desktops and phones are multicore! o Why? The end of “Dennard Scaling”

What happened? Dennard Scaling

 Robert Dennard, 1974  “Power density stays constant as transistors get smaller”  Intuitively: o Smaller transistors → shorter propagation delay → faster frequency o Smaller transistors → smaller capacitance → lower voltage

o 푃표푤푒푟 ∝ 퐶푎푝푎푐푖푡푎푛푐푒 × 푉표푙푡푎푔푒2 × 퐹푟푒푞푢푒푛푐푦 o Moore’s law → Faster performance @ Constant power!

Unfortunately, as transistors became too small, there was a limit to how low voltage can go Option 1: Continue Scaling Frequency at Increased Power Budget

0.014 μ Option 2: Stop Frequency Scaling

Dennard Scaling Ended (~2006)

Danowitz et.al., “CPU DB: Recording Microprocessor History,” Communications of the ACM, 2012 Looking Back: Change of Predictions But Moore’s Law Continues Beyond 2006 State of Things at This Point (2006)

 Single-thread performance scaling ended o Frequency scaling ended (Dennard Scaling) o Instruction-level parallelism scaling stalled … also around 2005

 Moore’s law continues Instruction Level Parallelism o Double transistors every two years o What do we do with them?

K. Olukotun, “Intel CPU Trends” Crisis Averted With Manycores? Crisis Averted With Manycores?

We’ll get back to this point later. For now, multiprocessing! The hardware for parallelism: Flynn taxonomy (1966) recap

Data Stream Single Multi Instruction Single SISD SIMD Stream (Single-Core Processors) (GPUs, Intel SSE/AVX extensions, …)

Multi MISD MIMD (Systolic Arrays, …) (VLIW, Parallel Computers) Flynn taxonomy

Single-Instruction Single-Instruction Single-Data Multi-Data (Single-Core Processors) (GPUs, Intel SIMD Extensions)

Today

Multi-Instruction Multi-Instruction Single-Data Multi-Data (Systolic Arrays,…) (Parallel Processors) Shared memory multiprocessor

 SMP: shared memory multiprocessor o Hardware provides single physical address space for all processors o Synchronize shared variables using locks o Memory access time • UMA (uniform) vs. NUMA (nonuniform)  Also often SMP: Symmetric multiprocessor o The processors in the system are identical, and are treated equally

 Typical chip-multiprocessor (“multicore”) consumer computers UMA between cores sharing a package, But NUMA across cores in different packages. Memory System Architecture Overall, this is a NUMA system Package Package

Core Core Core Core

L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$

L2 $ L2 $ L2 $ L2 $

L3 $ L3 $

QPI / UPI

DRAM DRAM Memory System Bandwidth Snapshot (2020)

Cache Bandwidth Estimate 64 Bytes/Cycle ~= 200 GB/s/Core Processor Processor DDR4 2666 MHz 128 GB/s

QPI / UPI

DRAM Ultra Path Interconnect DRAM Unidirectional 20.8 GB/s

Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die All sorts of things are now on-die! Even network controllers! Memory system issues with multiprocessing

 Suppose two CPU cores share a physical address space o Distributed caches (typically L1) o Write-through caches, but same problem for write-back as well

Time Event CPU A’s CPU B’s Memory step cache cache 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

Wrong data! Memory system issues with multiprocessing

 What are the possible outcomes from the two following codes? o A and B are initially zero

Processor 1: Processor 2:

1: A = 1; 3: B = 1; 2: print B 4: print A

o 1,2,3,4 or 3,4,1,2 etc : “01” o 1,3,2,4 or 1,3,4,2 etc : “11”

o Can it print “10”, or “00”? Should it be able to? Memory problems with multiprocessing

 Cache coherency (The two CPU example) o Informally: Read to each address must return the most recent value o Complex and difficult with many processors o Typically: All writes must be visible at some point, and in proper order

 Memory consistency (The two processes example) o How updates to different addresses become visible (to other processors) o Many models define various types of consistency • Sequential consistency, causal consistency, relaxed consistency, … o In our previous example, some models may allow “10” to happen, and we must program such a machine accordingly Keeping multiple caches coherent

 A bad idea: Broadcast all writes to all other cores o Lots of unnecessary on-chip network traffic o Each cache needs to do many lookups to see if incoming write notifications need to be applied • Further bad idea: applying all write requests result in all caches have copies of all data No point of having distributed cache! Keeping multiple caches coherent

 Basic idea o If a cache line is only read, many caches can have a copy o If a cache line is written to, only one cache at a time may have a copy

 Typically two ways of implementing this o “Snooping-based”: All cores listen to requests made by others on the memory bus o “Directory-based”: All cores consult a separate entity called “directory” for each cache access Snoopy cache coherence

 All caches listen (“snoop”) to the traffic on the memory bus o There is enough information just from that to ensure coherence

Core Core Core

Cache Cache Cache

Shared Cache Memory bus

DRAM Detour: A bus interconnect

 A bus is an interconnect that connects multiple entities o Cores, memory, (peripheral devices…)  Clock-synchronous, shared resource o Only one entity may be transmitting at any given clock cycle o All data transfers are broadcast, and all entities on the bus can listen to all communication o If multiple entities want to send data (a “multi-master” configuration) a separate entity called the “bus arbiter” must assign which master can write at a given cycle  Contrary to switched interconnects like packet-switched, or crossbar o We’ll talk about these later! MSI: A simple cache coherency protocol

 Each cache line can be in one of three states o M (“Modified”): Dirty data, no other cache has copy • Safe to write to cache. DRAM is out-of-date o S (“Shared”): Not dirty data, other caches may have copy o I (“Invalid”): Data not in cache  Cache tag augmented with state information

Core Core Core

Cache Cache Cache Snoopy implementation of MSI

 Originally, each cache could send two requests on the memory bus o Read: For both read and write requests from core (write-back) o Flush: For cache evictions  Cache is modified with two more requests o Read: For read requests from core o ReadExclusive: For write requests from core • Data is read from DRAM to cache, with the intention to update o Upgrade: When a previously read cache line will be written to o Flush: Writeback to DRAM, in the case of cache evictions Snoopy implementation of MSI

 Core wants to read: Read from DRAM (via “Read” command), set to “S”  Core wants to write: o If cache has data: Send “Upgrade” command, set to “M” o If not: Read from DRAM (via “Read exclusive” command), set to “M”  Snoops “Read” command from another core: o If I have data as “S”, supply data o If I have data as “M”, flush and downgrade myself to “S”  Snoops “ReadExclusive” or “Upgrade”: o If I have data as “S”: invalidate o If I have data as “M”: flush, invalidate Impossible to snoop “Upgrade” while I am “M”. I should be the only one with this data while I am “M” Snoopy implementation of MSI

Jos van Eijndhoven, et. al., “Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices” Back to coherence example

 Writes invalidate other caches, ensuring correct coherence o This time assuming write-back

Time Event CPU A’s CPU B’s Memory step cache cache 0 0 1 CPU A reads X 0 (S) 0 2 CPU B reads X 0 (S) 0 (S) 0 3 CPU A writes 1 to X 1 (M) (I) 0 4 CPU B reads X 1 (S) 1 (S) 1 Beyond MSI

 Many improvements on the MSI protocol proposed, implemented o MESI (adds “exclusive”), MOESI (adds “owner”), MESIF (adds “forward”)  Example: MESI o Addresses situation where data is read, and then updated in memory by one core • Very common situation! (LW, process, SW) • MSI would send Read, Upgrade commands, but if no other processes access this cache line, perhaps upgrade is unnecessary o When MESI snoops a Read command, each core checks its cache to see if it has that line • If cache exists in another core, data is copied between caches (new command) • If not, cache is in “E” state, which can upgrade to “M” without consulting others

Cache-to-cache data copy reduces main memory accesses in practice, performance improvement! Limitations of snoopy caches

 Inter-core cache coherence communications are all broadcast o Memory bus becomes a single point of serialization o Not very scalable beyond a handful of cores  Perhaps not that big of an issue with desktop-class CPUs o With ~8 cores and typical ratio of memory instructions, bus not yet the primary bottleneck o At least Intel Haswell seems to be still using snoopy • Rumored to use adaptive MESIF protocol Directory-based cache coherence

 Uses a separate entity called a “directory” which keeps track of the cache line state (MSI?) o Cache accesses first consult the directory, which either interacts with DRAM or remote caches o Directory can be implemented in a distributed way, to disperse the bottleneck  Pros of directory-based coherence o Given a scalable interconnect, much more scalable than a bus  Cons of directory-based coherence o Much more complex than snoopy, involving circuit implementations of distributed algorithms

We won’t cover this in detail right now… Performance issue with cache coherence: False sharing  Different memory locations, written to by different cores, mapped to same cache line o Core 1 performing “results[0]++;” o Core 2 performing “results[1]++;”  Remember cache coherence o Every time a cache is written to, all other instances need to be invalidated! o “results” variable is ping-ponged across cache coherence every time o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)  Solution: Store often-written data in local variables Some performance numbers with false sharing Memory problems with multiprocessing

 What are the possible outcomes from the two following codes? o A and B are initially zero

Processor 1: Processor 2:

1: A = 1; 3: B = 1; 2: print B 4: print A

o 1,2,3,4 or 3,4,1,2 etc : “01” o 1,3,2,4 or 1,3,4,2 etc : “11”

o Can it print “10”, or “00”? Should it be able to? • Given cache coherency? Why we need a consistency model

 Given an in-order multicore processor, cache coherence should solve memory consistency automatically o But in-order leaves a lot of performance on the table o e.g., Out-of-Order superscalar brings a lot of performance gains From a single processor’s perspective, it looks like there is no dependency!

 A non-OoO inconsistency example: write buffers o In order to hide cache miss latency for writes, write requests are often stored in a small “write buffer”, instead of waiting until the cache processing is done o Requests in the write buffer may be applied out-of-order, depending on cache operation o Remote core may see memory writes applied out of order! (e.g., “00”) Memory model is specified as a part of the ISA Why we need a consistency model

 Once requests can be haphazardly reordered, algorithmic correctness becomes really hard to enforce! o e.g., How would we implement a mutex? o If “locked=true” is buffered in a write buffer while a process is in a critical section…  Memory consistency model: “A specification of the allowed behavior of multithreaded programs executing with shared memory”  Memory model is defined, specified as part of the ISA o Compilers, programmers write/optimize software to work with the model o Simpler model (typically) simpler for compiler, complex hardware o Complex model (typically) harder for compiler, simple hardware The most stringent model: “Sequential consistency” (SC)  The following is what the model requires, and the processor hardware must abide by  Within a core, all memory I/O is applied in program order o Regardless of OoO or write buffers, memory I/O leaving the core should be reordered according to program order  Across cores, all cores must see the same global order o Typically solved already via cache coherency  Parallel program execution behaves “as you would expect”  Too stringent for efficient hardware implementation

SC requires I/O with no dependency to be ordered as well! Weaker memory models for performance

 To satisfy SC, 2 and 4 can’t execute until 1 and Processor 1: Processor 2: 3 become visible globally 1: A = 1; 3: B = 1; o 2 and 4 stalled until writes flushed to L3 cache! 2: print B 4: print A  Instead, the two writes can be buffered in Write buffers write buffers, and continue on to reads o Reads will check write buffers as well as cache, Core Core ensuring correct single-thread behavior L1 I$ L1 D$ L1 I$ L1 D$ o Significant performance improvement* o BUT, this makes “00” or “10” possible L2 $ L2 $

L3 $

Package *Exact impact has been debated, especially with speculation-based improvements on SC A weaker memory model: “Total Store Order” (TSO)  Allows later reads to finish before earlier writes o Read and writes among themselves must maintain program order o But ordering between reads and writes can be changed Processor 1: Processor 2: o Basically, model allows write buffers 1: A = 1; 3: B = 1;  Now our example can emit both “00” and “10” 2: print B 4: print A

 It is important that weaker models do not effect single-core ordering, or access to the same location o Intra-core dependencies are always maintained o Effects only the order in which memory I/O is visible to other cores

Aside: x86 seems to follow a modified version of TSO Aside: Weaker memory models in the wild

 A wide variety of weaker memory models exist o Processor Consistency (PC) o Release Consistency (RC) o Eventual Consistency (EC), …  Some known memory models o x86 seems to follow a modified form of TSO o ARM model is notoriously underspecified o RISC-V allows choice between RC and TSO  How beneficial are these memory models is debated… o TSO consistently outperforms SC, but everything beyond is muddy Weaker memory models and synchronization  How do we implement synchronization in the face of memory re- ordering?  A new instruction “Fence” specifies where ordering must be enforced o All memory operations before a fence must finish before fence is executed • And visible to other processors/threads! o All memory operations after a fence must wait until fence is executed

 RISC-V base ISA includes two fence operations o FENCE, FENCE.I (for instruction memory) So far…

 Why parallelism?  Memory issues with parallelism o Cache coherence o Memory consistency Hardware support for synchronization

 In parallel software, critical sections implemented via mutexes are critical for algorithmic correctness  Can we implement a mutex with the instructions we’ve seen so far? o e.g., while (lock==False); lock = True; // critical section lock = False; o Does this work with parallel threads? Hardware support for synchronization

 By chance, both threads can think lock is not taken o e.g., Thread 2 thinks lock is not taken, before thread 1 takes it o Both threads think they have the lock

Thread 1 Thread 2

while (lock==False); while (lock==False);

lock = True; lock = True;

// critical section // critical section

lock = False; lock = False;

Algorithmic solutions exist! Dekker’s algorithm, Lamport’s bakery algorithm… Synchronization algorithm example: Dekker’s algorithm

Does this still work with TSO?

Also, performance overhead…

The solution is attributed to Dutch mathematician Th. J. Dekker by Edsger W. Dijkstra in an unpublished paper Hardware support for synchronization

 A high-performance solution is to add an “atomic instruction” o Memory read/write in a single instruction o No other instruction can read/write between the atomic read/write o e.g., “if (lock=False) lock=True”  Fences are algorithmically sufficient to implement atomic operations, but this is more efficient

Single instruction read/write is in the grey area of RISC paradigm… RISC-V example

 Atomic instructions are provided as part of the “A” (Atomic) extension  Two types of atomic instructions o Atomic memory operations (read, operation, write) • operation: swap, add, or, xor, … o Pair of linked read/write instructions, where write returns fail if memory has been written to after the read • More like RISC! • With bad luck, may cause livelock, where writes always fail  Aside: It is known all synchronization primitives can be implemented with only atomic compare-and-swap (CAS) o RISC-V doesn’t define a CAS instruction though Pipelined implementation of atomic operations  In a pipelined implementation, even a single-instruction read-modify- write can be interleaved with other instructions o Multiple cycles through the pipeline

 Atomic memory operations o Modify cache coherence so that once an atomic operation starts, no other cache can access it o Other solutions?  Linked read/write operations o Need only to keep track of a small number of linked read addresses