CIS 631 Parallel Processing

Lecture 3 Parallel Computer Architectures Due end of this week: Proposal Invite me to your Bitbucket account: jeewhanchoi Have two Bitbucket Repos - 1) Personal repo (1 per student) survey/ Logistics homework/01 homework/02 Etc. 2) Group repo (1 per group) proposal/ code/ report/ ● All coding must be version controlled using git & BitBucket. Misc. ● All reports must be written using Latex. Pipelining Previous What is it? Lecture Why do we need it? (Or what are the performance benefits?) What are its limitations? (Or when does it not work?) Superscalar execution - a is capable of executing more than one instruction at the same time (ILP - instruction level parallelism)

Superscalar and OOE What is the maximum number of sustained IPC this system can deliver?

Superscalar and OOE Why did I combine superscalar and OOE? ● Well, to execute multiple instructions at the same time, simple Superscalar in-order execution is typically incapable of providing enough independent instructions to fill the pipeline. and OOE ● Therefore, we have a reordering buffer that analyzes the instruction streams to find independent instructions and schedule ready-to-execute instructions even if it comes later - i.e., out-of-order execution. Superscalar Reordering buffer ● Scoreboarding and OOE ● Tomasulo’s algorithm Hazards ● Structural hazards ● Occurs when a part of processor’s hardware is needed by two or more instructions at the same time ● Control hazards ● Branches - the architecture does not know which instruction to fetch next ● Data hazards ● Read after write (RAW) - true dependency ● i2 reads a source before i1 writes to it Hazards ● i1. R2 <- R1 + R3 ● i2. R4 <- R2 + R3 ● Write after read (WAR) - anti-dependency ● i2 writes to a destination before i1 reads it ● i1. R4 <- R1 + R5 ● i2. R5 <- R1 + R2 ● Write after write (WAW) - output dependency ● i2 writes to a destination before it is written by i1 ● i1. R2 <- R4 + R7 ● i2. R2 <- R1 + R3 Assume single-issue, in-order, and 5 stage pipeline

Hazards Only 1 memory unit, and both instructions need to use it to fetch data and instruction

Structural Hazard Insert “bubbles” (i.e., stalls) in the pipeline.

Structural Hazard Branch prediction ● The processor tries to “guess” whether the branch will be Branch taken or not. ● If the prediction was incorrect (i.e., mispredicted), then the Hazard pipeline needs to be flushed - very expensive. ● Loops are a big source of branches - branches are taken more often than not. Branch Prediction RAW

Data Hazards (RAW)

Data Hazards (RAW) Does not always work...

Data Hazards OOE helps alleviate these data hazards, so that more independent instructions can be identified and issued. Scoreboarding simply keeps track of instruction dependencies to determine whether an instruction can start executing It does not eliminate dependencies Tomasulo’s algorithm uses scoreboarding with register Scoreboarding renaming/coloring to eliminate WAR and WAW dependencies. and WAR and WAW can be overcome if we change the source or destination registers for one of the instructions that have Tomasulo’s dependencies. Algorithm WAR i1. R4 <- R1 + R5 i1. R4 <- R1 + R5 i2. R5 <- R1 + R2 i2. R6 <- R1 + R2 WAW i1. R2 <- R4 + R7 i1. R2 <- R4 + R7 i2. R2 <- R1 + R3 i2. R6 <- R1 + R3 Scoreboarding and Tomasulo’s Algorithm

In this case, OOE using scoreboarding does not help that much - they both end at the same time. Scoreboarding and Tomasulo’s Algorithm

In this case, OOE using scoreboarding does not help that much - they both end at the same time. Scoreboarding F4’ and F4’ Tomasulo’s Algorithm

Tomosulo’s algorithm can “rename” the registers in the hardware so that instruction 5 no longer depends on instruction 3. Hardware (multi)threading. ● Simultaneous multithreading (SMT). ● Intel - hyper threading (HT). ● AMD - clustered multithreading (CMT). Hardware Each SMT thread maintain its own architectural “state.” Threading ● Data registers. ● Control registers (e.g., stack pointer, instruction pointer, etc.). They share execution resources ● Execution pipeline ● Cache Hardware Threading SMT increases ILP by adding another thread of execution - additional thread may come from a different application. However, it may decrease performance as well ● One thread has enough ILP to fill the pipeline - now, fewer Hardware resource (e.g., registers, cache) is available for this tread. Threading Some modern architectures require SMT to fully utilize the pipeline ● Intel Xeon Phi (RIP) ● IBM Power8/9 ● One thread gets 64 entries in the fetch buffer ● Two thread gets 128 entries ● 4-8 threads gets 128 entries Hardware Threading Single Instruction Multiple Data ● First implemented in vector processors ● Also referred to as SIMD vectorization ● Present in most recent processors ● Intel - SSE/AVX ● AMD - 3DNow! (then SSE) SIMD ● Arm - NEON ● Compiler can sometimes figure this out automatically. ● You can use intrinsics (special C/C++ instructions that mimic SIMD assembly instructions) - they are just “hints” so the compiler does not guarantee SIMD vectorization. SIMD

Memory Hierarchies

L3 cache is also referred to as last level cache (LLC). L1 cache is divided between instruction (L1i) and data (L1d) cache. Inside the core, data is stored in registers. Traditionally, in multi-core processors, L3 and L2 are (typically) shared among all cores, and L1 is local to each core Intel Skylake architecture ● L1i/L1d cache ● 32 KB per core, 8-way set associative ● 64 Byte/cycle load, 32 Byte/cycle store ● 4 or 5 cycle access (depending on how address is calculated) ● L2 cache ● 256 KB per core (unified), 4-way set associative Modern Cache ● 64 Byte/cycle (to L1 cache) ● 12 cycle access ● L3 cache ● 2 MB per core (but shared), 16-way associative ● 32 Byte/cycle ● 42 cycle access ● L4 cache (or embedded DRAM (eDRAM), side cache). ● 64/128 MB per package ● 32 Byte/cycle read/write (but runs on a separate eDRAM clock) Data locality ● Temporal - a data is reused within a short time frame Locality ● Spatial - if data in location i is used (e.g., array), data in location i+1 is also likely to be used τ = some factor which captures how much faster cache is vs. DRAM (includes both bandwidth and latency) κ = cache “reuse” rate

S(τ,κ) = Tmem / Tavg Temporal T = T / τ →T = T τ Locality cache mem mem cache

Tavg = κTcache + (1-κ)Tmem

S(τ,κ) = Tcacheτ / (κTcache + (1-κ)Tcacheτ ) = τ / (κ + (1-κ)τ)

What does this remind you of? Temporal Locality How do we increase temporal locality? ● Cache “blocking” (or tiling). ● Minimizes the chance that data loaded into cache won’t be Locality evicted What if data is “streamed?” ● Spatial locality? Cache lines ● Data transfer between DRAM and cache occurs in cache line granularity ● This is done to reduce latency and to take advantage of spatial locality Spatial ● T = ɑ + βL ● ɑ = time between data request and delivery; L = cache line Locality size; β = 1 / bandwidth ● Also known as alpha beta model ● Let’s say latency is 80 ns and bandwidth = 40 GB/s. Typical cache line size is 64 Bytes. ● Time to load a cache line = 81.6 ns ● Without cache lines (and reading 8 bytes at a time)? ● 8 x (80 ns + 8 Bytes /40GB/s) = 641.6 ns What if we don’t need all 64 Bytes? ● Let’s say we need to read 64 bytes but they are separated by 56 Bytes - i.e., first 8 bytes of 8 cache lines ● T = 8 x (80 ns + 64 Bytes / 40 GB/s) = 652.8 ns (vs. 641.6 ns) Now, let’s look at the streaming case. Spatial ● What is the cache hit rate? (this is not the same as the reuse Locality rate described above) ● Assuming cache line arrives in cache before the 2nd data access ● γ = (64 - 8) / 64 = 0.875 However, if you have enough parallelism to hide the latency (i.e., you are limited by only the bandwidth), your potential performance is ⅛ what it could be. What happens in writing data (as opposed to reading)? ● Most LLC use write-back - if hit in cache, cache line is modified and written to memory when evicted. ● On a cache miss - entire cache line is brought in (write allocate) which causes 2x data traffic

Write allocate is required due to hardware design - registers can only Write Allocate communicate with L1 If data is not re-used (i.e., read from the cache after writing), it’s an unnecessary penalty - is there something that can be done? ● Non-temporal stores - special store instructions that bypass all cache levels and directly write to memory. There is also typically a write combine buffer that bundles non-temporal stores to better utilize the memory system ● Cache line zero - zero-out a cache line and mark it as modified without a read, and data is written to memory when evicted. Direct-mapped - memory location that are multiple of cache line size apart are always mapped to the same cache location ● Easy to implement - just mask out the most significant bits. ● Prone to cache thrashing (i.e., conflict misses) - applications with strided access map to the same cache location at every Associativity iteration. Fully-associative - any memory location can be mapped to any location in the cache ● Difficult to build large, fast, and fully-associative caches due to book-keeping. ● Every entry in the cache must be checked to see if a new memory request is already in cache. Direct-Mapped Cache Reduce conflict misses without having huge book-keeping overhead N-Way Cache is divided into N direct-mapped caches (equal in size). Set-Associative Set associative typically between 2 to 48 on modern processors. 2-Way Set-Associative Even if you improve spatial locality using cache lines, latency still exists on the first miss.

Prefetching

Making the cache lines longer will help reduce the latency occurrence. ● Counterproductive when there is irregular access pattern. ● Current “sweet spot” seem to be 64 or 128 Byte cache lines. Prefetching can help with this. “Fetch” the cache line before it is requested. ● Prefetching instructions (software) inserted by the programmer or the compiler. ● Hardware prefetcher tries to “predict” by studying the access pattern. Software prefetching Prefetching ● Instructions are “hints” to the architecture - not guaranteed to work. ● Increases instruction count - could degrade performance (instruction cache misses). ● “Timing” the prefetch is difficult. Hardware prefetching ● Specialized hardware. ● Trades bandwidth for latency. Hardware Prefetching ● Prefetching requires resources that are limited by design ● The memory system must be able to sustain a certain number of outstanding prefetches (i.e., pending prefetch requests) ● Otherwise, the memory pipeline will stall and the latency cannot be hidden completely ● How many outstanding prefetches are required? (Hint: Little’s Law) Prefetching Number of cache lines that can be transferred during time T is the number of prefetches that the processor must be able to sustain L = cache line size; β = 1 / bandwidth; ɑ = latency # of cache lines transferred in time t = (t x bandwidth / L) = t / Lβ t = ɑ + Lβ P = t / (Lβ) = (ɑ + Lβ) / Lβ = 1 + ɑ/(Lβ) concurrency P = Latency x bandwidth = (ɑ/Lβ) (Little’s Law) If you also doing some amount of compute, some of that can be used to hide the memory latency. In such a case, fewer outstanding fetches will suffice to saturate the bandwidth.

Prefetching CISC (Complex Instruction Set Computer) ● Complex, more powerful instructions ● Requires larger hardware for decoding the instructions ● Reduces the number of instructions RISC (Reduced Instruction Set Computer) RISC vs. CISC ● Simple instructions that can be decoded quickly and executed rapidly (i.e., fewer , therefore higher possible). Intel is a well-known CISC architecture ● This is only partially true. Machine code is CISC, but internally, they are converted to a set of micro-ops, which resemble RISC instructions.