ECE5917 SoC Architecture: MP SoC

Tae Hee Han: [email protected] Semiconductor Systems Engineering Sungkyunkwan University Outline n Overview

n Parallelism

n Data-Level Parallelism

n Instruction-Level Parallelism

n Thread-Level Parallelism

n Processor-Level Parallelism

n Multi-core

2 Parallelism - Thread Level Parallelism

3 Superscalar (In)Efficiency

Issue width Instruction issue

Completely idle cycle (vertical waste) •Introduced when the processor issues no instructions in a cycle

Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) •Occurs when not all issue slots can be filled in a cycle

4 Thread n Definition

n A discrete sequence of related instructions

n Executed independently of other such sequences n Every program has at least one thread

n Initializes

n Executes instructions

n May create other threads n Each thread maintains its current state n OS maps a thread to hardware resources

5 Multithreading n On a single processor, multithreading generally occurs by time- division multiplexing (as in multitasking) – context switching n On a multiprocessor or multi-core systems, threads can be truly concurrent, with every processor or core executing a separate thread simultaneously

n Many modern OS directly support both time-sliced and multiprocessor threading with a process scheduler

n The kernel of an OS allows programmer to manipulate threads via the system call interface

6 Thread Level Parallelism (TLP) n Interaction with OS Physical Memory n OS perceives each core as a separate processor

n OS scheduler maps threads/processes to different logical (or virtual) cores Virtual Memory Virtual Memory (ASID 1) (ASID 2) n Most major OS support multithreading today

Process 1 Process 2 n TLP explicitly represented by the use of multiple Thread 1 Thread 2 Thread 1 Thread 2 Thread 3

threads of execution that are inherently parallel • Stack • Stack • Stack • Stack • Stack • Register • Register • Register • Register • Register • PC • PC • PC • PC • PC n Goal: Use multiple instruction streams to improve Thread Scheduler (OS)

n Throughput of computers that run many programs

n Execution time of multi-threaded programs Processor Processor Core 1 Core 2 (e.g 2-way SMT) (e.g 2-way SMT) n TLP could be more cost-effective than ILP

7 Multithreaded Execution

n Multithreading: multiple threads share functional units of 1 processor via overlapping

n Processor must duplicate independent state of each thread

n Separate copy of register file, PC

n Separate page table if different process

n Memory sharing via virtual memory mechanisms

n Already supports multiple processes

n HW for fast thread switch

n Must be much faster than full process switch (which is 100s to 1000s of clocks)

n When to switch?

n Alternate instruction per thread (fine grain)—round robin

n When thread is stalled (coarse grain)

n e.g., cache miss

8 Conceptual Illustration of Multithreaded Architecture

Program running 1 2 3 4 in parallel

i = n Sub-problem j = m Serial Code A i = 3 j = 2 Concurrent i = 2 Sub-problem j = 1 threads of i = 1 B Sub-problem Computation C

Hardware Streams

Unused Streams

Instruction Ready Pool

Pipeline of executing Instructions

9 Sources of Wasted Issue Slots

Source Possible latency-hiding or latency-reducing technique

Increase TLB sizes, HW instruction prefetching, HW or SW data prefetching, TLB miss faster servicing of TLB misses

I-cache miss Increase cache size, more associativity, HW instruction prefetching

Increase cache size, more associativity, HW or SW data prefetching, improved D-cache miss instruction scheduling, more sophisticated dynamic execution

Branch misprediction Improved branch prediction scheme, lower branch misprediction penalty

Control hazard Speculative execution, more aggressive if-conversion

Load delays (L1 cache hits) Shorter load latency, improved instruction scheduling, dynamic scheduling

Short integer delay Improved instruction scheduling

Long integer, short FP, long Shorter latencies, improved instruction scheduling FP delays

Memory conflict Improved instruction scheduling

10 Fine-Grained Multithreading n Switches between threads on each instruction, interleaving execution of multiple threads n Usually done round-robin, skipping stalled threads n CPU must be able to switch threads every clock n Advantage: can hide both short and long stalls

n Instructions from other threads always available to execute

n Easy to insert on short stalls n Disadvantage: slows individual threads

n Thread ready to execute without stalls will be delayed by instructions from other threads n Used on Sun (Now Oracle) Niagara (UltraSPARC T1) – Nov. 2005

11 Coarse-Grained Multithreading

n Switches threads only on costly stalls: e.g., L2 cache misses

n Advantages

n Relieves need to have very fast thread switching

n Doesn’t slow thread

n Other threads only issue instructions when main one would stall (for long time) anyway

n Disadvantage: pipeline startup costs make it hard to hide throughput losses from shorter stalls

n Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread

n New thread must fill pipe before instructions can complete

n Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time

n Used in IBM AS/400

12 Simple Multithreaded Pipeline

PC X PC 1 GPR1 PC 1 I$ IR GPR1GPR1 PC 1 GPR1 1 Y D$

+1

2 2 Thread select

n Additional state: One copy of architected state per thread (e.g., PC, GPR)

n Thread select: Round-robin logic; Propagate Thread-ID down pipeline to access correct state (e.g., GPR1 versus GPR2)

n OS perceives multiple logical CPUs

13 Cycle Interleaved MT (Fine-Grain MT)

Issue width Instruction issue

Second thread interleaved cycle-by-cycle

Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)

Cycle interleaved multithreading reduces vertical waste with cycle-by-cycle interleaving. However, horizontal waste remains.

14 Chip Multiprocessing (CMP)

Issue width

Second thread interleaved cycle-by-cycle

Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)

Chip multiprocessing reduces horizontal waste with simple (narrower) cores. However, (1) vertical waste remains and (2) ILP is bounded.

15 Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

n Interleave multiple threads to multiple issue slots with no restrictions

Issue width

Time

16 Simultaneous Multithreading (SMT) Motivation n Fine-grain Multithreading

n HEP, Tera, MASA, MIT Alewife

n Fast context switching among multiple independent threads

n Switch threads on cache miss stalls – Alewife

n Switch threads on every cycle – Tera, HEP

n Target vertical wastes only

n At any cycle, issue instructions from only a single thread n Single-chip MP

n Coarse-grain parallelism among independent threads in a different processor

n Also exhibit both vertical and horizontal wastes in each individual processor pipeline

17 Simultaneous Multithreading (SMT)

n An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars)

n SMT has the potential of greatly enhancing superscalar processor computational capabilities by

n Exploiting thread-level parallelism in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle

n A single physical SMT processor core acts as a number of logical processors each executing a single thread

n Providing multiple hardware contexts, hardware thread scheduling and context switching capability

n Providing effective long latency hiding

n e.g.) FP, branch misprediction, memory latency

18 Simultaneous Multithreading (SMT)

n ’s HyperThreading (2-way SMT)

n IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores - each 2-way SMT, 4 chips per package): Power5 has OoO cores, Power6 In-order cores;

n Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

Fdiv, unpipe (16 cycles) Fetch Decode RS & ROB Fmult Unit + (4 cycles) Physical RegReg RegisterRegister Register Fadd RegReg RegisterRegister (2 cyc) FileFileRegReg PCPC RRenameenameRegisterRegisterr r File FileFileFileRegReg PCPC RRenameRenameenameRegisterRegisterrrr FileFile PCPCPC RRenameenamerr File

PC Renamer 1 ALU 2 ALU I-CACHE Load/Store D-CACHE (variable)

19 Overview of SMT Hardware Changes

n For an N-way (N threads) SMT, we need:

n Ability to fetch from N threads

n N sets of registers (including PCs)

n N rename tables (RATs)

n N virtual memory spaces

n But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

20 Multithreading: Classification

FU1 FU2 FU3 FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time Execution

Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP or (SMT) Threaded Interleaving) Multi-Core)

21 SMT Performance n When it works, it fills idle “issue slots” with work from other threads; throughput improves

n But sometimes it can cause performance degradation!

Time( ) < Time( ) Finish one task, Do both at same then do the other time using SMT

22 How? n Cache thrashing

L2 I$ D$ Executes reasonably quickly due Thread just fits in 0 to high cache I$ D$ the Level-1 Caches hit rates

Context switch to Thread1 Caches were just big enough to hold one thread’s data, but not two thread’s worth I$ D$

Now both threads have Thread1 also fits nicely in the caches significantly higher cache miss rates

à Intel Smart Cache!

23 Multithreading: How Many Threads?

n With more HW threads:

n Larger/multiple register files

n Replicated & partitioned resources à Lower utilization, lower single-thread performance

n Shared resources à Utilization vs. interference and thrashing

n Impact of MT/MC on memory hierarchy?

Source: Guz et al. "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, 2009

24 SMT: Intel vs. ARM n In 2010, ARM said it might include SMT in its chips in the future; however this was rejected for their 2012 64-bit design

Noel Hurley (VP of marketing and strategy in ARM’s processor division) said ARM rejected SMT as an option. Although it can be used to hide the latency of memory accesses in parallel applications – a technique used heavily in GPUs – multithreading complicates the design of the pipeline itself. The tradeoff did not make sense for the engineering team, he said. ( http://www.techdesignforums.com/blog/2012/10/30/arm-64bit-cortex-a53-a57-launch/ ) n Intel conceded SMT will not be supported on its processor cores in order to save power

25 MT Analysis by ARM for Mobile Applications

n Evaluation tests by ARM have shown 4 that MT is not efficient for mobile

3.5 Relative mW devices Relative n Increasing performance by 50% will 3 DMIPS cost more than a 50% increase in 2.5 power

2 n MT is much less predictable than those of multi-core solutions 1.5 n In MT, the implementation cost of 1 ‘sleep mode’ becomes higher due to more sharing of resources between 0.5 multiple threads 0 Cortex-A7 Dual Cortex-A7 Cortex-A12 Est. of Cortex- n In high-end mobile apps which require A12 with MT superscalar and OoO based multi-core, then single-threaded multi-core

Source: http://www.arm.com/files/pdf/Multi-threading_Technology.pdf implementations such as big.LITTLE are the most efficient solution

26 Summary

n Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

n Data level parallelism and/or Thread level parallelism is exploited to better performance

n Coarse grain vs. Fine grain multithreading

n Only on big stall vs. every clock cycle

n Simultaneous Multithreading if fine grained multithreading based on OOO superscalar micro-architecture

n Instead of replicating registers, reuse rename registers

27 Parallelism - Processor Level Parallelism

28 Beyond ILP (Instruction Level Parallelism) n Performance is limited by the serial fraction

n Coarse grain parallelism in the post ILP era

n Thread, process and data parallelism

n Learn from the lessons of the parallel processing community

n Revisit the classifications and architectural techniques

parallelizable

1CPU 2CPUs 3CPUs 4CPUs

29 “Automatic” Parallelism in Modern Machines

n Bit level parallelism

n Within floating point operations, etc.

n Instruction level parallelism (ILP)

n Multiple instructions execute per clock cycle

n Memory system parallelism

n Overlap of memory operations with computation

n OS parallelism

n Multiple jobs run in parallel on commodity SMPs

Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks

30 Principles of Parallel Computing

n Finding enough parallelism (Amdahl’s Law)

n Granularity

n Locality

n Load balance

n Coordination and synchronization

n Performance modeling

All of these things makes parallel programming even harder than sequential programming

31 Finding Enough Parallelism

n Suppose only part of an application seems parallel

n Amdahl’s law

n Let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable

n P = number of processors

Speedup(P) = Time(1)/Time(P) = 1/(s + (1-s)/P) @ 1/s if s approaches 1

n Even if the parallel part speeds up perfectly performance is limited by the sequential part

32 Overhead of Parallelism n Given enough parallel work, this is the biggest barrier to getting desired speedup n Parallelism overheads include:

n cost of starting a thread or process

n cost of communicating shared data

n cost of synchronizing

n extra (redundant) computation n Each of these can be in the range of milliseconds (=millions of flops) on some systems n Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work

33 Locality and Parallelism

Conventional Proc Proc Proc Storage L1 Cache L1 Cache L1 Cache Hierarchy L2 Cache L2 Cache L2 Cache interconnects potential

L3 Cache L3 Cache L3 Cache

Memory Memory Memory

n Large memories are slow, fast memories are small

n Storage hierarchies are large and fast on average

n Parallel processors, collectively, have large, fast cache

n the slow accesses to “remote” data we call “communication”

n Algorithm should do most work on local data

34 Load Imbalance n Load imbalance is the time that some processors in the system are idle due to

n Insufficient parallelism (during that phase)

n Unequal size tasks n Examples of the latter

n Adapting to “interesting parts of a domain”

n Tree-structured computations

n Fundamentally unstructured problems n Algorithm needs to balance load

35 Computer Architecture Classifications

Processor Organizations

Single Instruction Multiple Instruction Single Instruction Multiple Instruction Multiple Data Multiple Data Single Data Stream Single Data Stream Stream Stream (SISD) (MISD) (SIMD) (MIMD)

Centralized Shared Uniprocessor Vector Array Memory Architecture Distributed Memory processor processor (UMA) Architecture

Distributed Message Passing Shared memory (NUMA)

36 Multiprocessors

n Why do we need multiprocessors?

n Uniprocessor speed keeps improving

n But there are things that need even more speed

n Wait for a few years for Moore’s law to catch up?

n Or use multiple processors and do it now?

n Multiprocessor software problem

n Most code is sequential (for uniprocessors)

n MUCH easier to write and debug

n Correct parallel code very, very difficult to write

n Efficient and correct is even harder

n Debugging even more difficult (Heisenbugs)

ø Heisenbug is a computer programming jargon term for a software bug that seems to disappear or alter its behavior when one attempts to study it. The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state.

37 MIMD Multiprocessors

Centralized Shared Memory Distributed Memory

38 Centralized-Memory Machines n Also “Symmetric Multiprocessors” (SMP) n “Uniform Memory Access” (UMA)

n All memory locations have similar latencies

n Data sharing through memory reads/writes

n P1 can write data to a physical address A, P2 can then read physical address A to get that data

n Problem: Memory Contention

n All processor share the one memory

n Memory bandwidth becomes bottleneck

n Used only for smaller machines

n Most often 2,4, or 8 processors

39 Distributed-Memory Machines

n Two kinds

n Distributed Shared-Memory (DSM)

n All processors can address all memory locations

n Data sharing like in SMP

n Also called NUMA (non-uniform memory access)

n Latencies of different memory locations can differ (local access faster than remote access)

n Message-Passing

n A processor can directly address only local memory

n To communicate with other processors, must explicitly send/receive messages

n Also called multicomputers or clusters

n Most accesses local, so less memory contention (can scale to well over 1,000 processors)

40 Another Classification n Two Models for Communication and Memory Architecture 1. Communication occurs by explicitly passing messages among the processors: Message-passing multiprocessors 2. Communication occurs through shared address space (via loads and stores): Shared memory multiprocessors either

n UMA (Uniform Memory Access time) for shared address, centralized memory MP

n NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP n In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

41 Process Coordination: Shared Memory vs. Message Passing n Shared memory

n Efficient, familiar global int x process foo process bar n Not always available begin begin n Potentially insecure : : x := ... y := x : : end foo end bar n Message passing

n Extensible to communication in distributed systems

Canonical syntax:

send (process : process_id, message : string) receive (process : process_id, var message : string)

42 Message Passing Protocols

Send Recv

CPU0 CPU1 n Explicitly send data from one thread to another

n Need to track ID’s of other CPUs

n Broadcast may need multiple send’s

n Each CPU has own memory space n Hardware: send/recv queues between CPUs n Program components can be run on the same or different systems, so can use 1,000s of processors. n “Standard” libraries exist to encapsulate messages:

n Parasoft's Express (commercial)

n PVM (standing for Parallel Virtual Machine, non-commercial)

n MPI (Message Passing Interface, also non-commercial).

43 Message Passing Machines

n A cluster of computers

n Each with its own processor and memory

n An interconnect to pass messages between them

n Producer-Consumer Scenario:

n P1 produces data D, uses a SEND to send it to P2

n The network routes the message to P2

n P2 then calls a RECEIVE to get the message

n Two types of send primitives

n Synchronous: P1 stops until P2 confirms receipt of message

n Asynchronous: P1 sends its message and continues

n Standard libraries for message passing: Most common is MPI – Message Passing Interface

44 Communication Performance

n Metrics for Communication Performance

n Communication Bandwidth

n Communication Latency

n Sender overhead + transfer time + receiver overhead

n Communication latency hiding

n Characterizing Applications

n Communication to Computation Ratio

n Work done vs. bytes sent over network

n Example: 146 bytes per 1000 instructions

45 Parallel Performance

n Serial sections

n Very difficult to parallelize the entire application

n Amdahl’s law Speedup = 1024 Speedup = 1024 1 Parallel Parallel SpeedupOverall = = F FParallel = 0.5 FParallel 0.99 (1 - F ) + Parallel Parallel Speedup = 91.2 SpeedupParallel SpeedupOverall = 1.998 Overall

n Large remote access latency (100s of ns)

n Overall IPC goes down

This cost reduced with CMP/multi-core CPI = CPIBase + RemoteRequestRate ´ RemoteRequestCost

400ns CPI = 0.4 RemoteRequestCost = = 1200 Cycles RemoteRequestRate = 0.002 Base 0.33ns/Cycle CPI = 2.8 We need at least 7 processors just to break even!

46 Message Passing Pros and Cons

n Pros

n Simpler and cheaper hardware

n Explicit communication makes programmers aware of costly (communication) operations

n Cons

n Explicit communication is painful to program

n Requires manual optimization

n If you want a variable to be local and accessible via LD/ST, you must declare it as such

n If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this

47 Message Passing: A Program

n Calculating the sum of array elements #define ASIZE 1024 #define NUMPROC 4 Must manually split the array double myArray[ASIZE/NUMPROC]; double mySum=0; for(int i=0;I < ASIZE/NUMPROC;i++) mySum+=myArray[i]; “Master” processor adds up if(myPID=0){ partial sums and prints the result for(int p=1;p < NUMPROC;p++){ int pSum; recv(p,pSum); mySum+=pSum; } printf(“Sum: %lf\n”,mySum); }else “Slave” processors send their send(0,mySum); partial results to master

48 Shared Memory Model

Main Memory

Write X Read X

CPU0 CPU1

n The processors are all connected to a "globally available" memory, via either a SW or HW means

n The operating system usually maintains its memory coherence n That’s basically it…

n Need to fork/join threads, synchronize (typically locks)

49 Shared Memory Multiprocessors: Roughly Two Styles n UMA (Uniform Memory Access)

n The time to access main memory is the same for all processors since they are equally close to all memory locations

n Machines that use UMA are called Symmetric Multiprocessors (SMPs)

n In a typical SMP architecture, all memory accesses are posted to the same shared memory bus

n Contention - as more CPUs are added, competition for access to the bus leads to a decline in performance

n Thus, scalability is limited to about 32 processors

Processor Processor Processor Processor

Cache Cache Cache Cache

Interconnection network

Memory

Conceptual Model

50 Shared Memory Multiprocessors: Roughly Two Styles

n NUMA (Non-Uniform Memory Access)

n Since memory is physically distributed, it is faster for a processor to access its own local memory than non-local memory (memory local to another processor or shared between processors)

n Unlike SMPs, all processors are not equally close to all memory locations

n A processor’s own internal computations can be done in its local memory leading to reduced memory contention

n Designed to surpass the scalability limits of SMPs

P1 P2 Pn $ $ $ The “Interconnect” usually includes § a cache-directory to reduce snoop traffic Memory Memory Memory § Remote Cache to reduce access latency (think of it Directory Directory Directory as an L3) Interconnect Cache-Coherent NUMA Systems (CC-NUMA): Non Cache-Coherent NUMA (NCC-NUMA)

51 Modern Multiprocessor System: Mixed NUMA & UMA

Processor Processor Processor Processor Processor Processor Processor Processor

Cache Cache Cache Cache Cache Cache Cache Cache

Interconnection network Interconnection network

Memory Memory

Interconnection network n In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or “node”

n Processors within a node share access to memory modules as per the UMA shared memory architecture

n At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture

Source: intel http://software.intel.com/en-us/articles/optimizing-applications-for-numa

52 Shared Memory: A Program

n Calculating the sum of array elements

#define ASIZE 1024 #define NUMPROC 4 Array is shared shared double array[ASIZE]; shared double allSum=0; Each processor sums up shared mutex sumLock; “its” part of the array double mySum=0; for(int i=myPID*ASIZE/NUMPROC;i<(myPID+1)*ASIZE/NUMPROC;i++) mySum+=array[i]; lock(sumLock); allSum+=mySum; Each processor adds its partial sums to the unlock(sumLock); final result if(myPID=0) printf(“Sum: %lf\n”,allSum); “Master” processor prints the result

53 Shared Memory Pros and Cons n Pros

n Communication happens automatically

n More natural way of programming

n Easier to write correct programs and gradually optimize them

n No need to manually distribute data (but can help if you do)

n Cons

n Needs more hardware support

n Easy to write correct, but inefficient programs (remote accesses look the same as local ones)

54 Communication/Connection Options for MPs n Multiprocessors come in two main configurations:

n a single bus connection, and a network connection n The choice of the communication model and the physical connection depends largely on the number of processors in the organization

n Notice that the scalability of NUMA makes it ideal for a network configuration

n UMA, however, is best suited to a bus connection

Typical Number of Category Choice Processors Message passing 8 ~ thousands Communication model Shared UMA 2~64 address NUMA 8~256 Network 8 ~ thousands Physical Connection Bus 2~32

55 Focus on Shared Memory Multiprocessors…

n We are more interested in single chip multi-core processor architecture rather than MPP systems in Data Centers

n It implements a memory system with a single global physical address space (usually)

n Goal 1: Minimize memory latency

n Use co-location & caches

n Goal 2: Maximize memory bandwidth

n Use parallelism & caches

56 Focus on Shared Memory Multiprocessors: Let’s See

GIC-500 10-40 GbE DSP PCIe DSPDSP Shared L2 Cache DPI Crypto USB SATA between 4 Quad Quad Quad Quad Cores Cortex- Cortex- Cortex- Cortex- A57 A57 A57 A57 NIC-400

L2 Cache L2 Cache L2 Cache L2 Cache MMU-500

CoreLink CCN-504 Cache Coherent Network

Snoop 8-16MB L3 Cache Filter Shared L3 Cache between 4 Clusters CoreLink CoreLink NIC-400 Network Interconnect (each DMC-520 DMC-520 cluster has 4 cores) Flash GPIO X72 X72 DDR4-3200 DDR4-3200

Source: ARM (2013)

57 Cache Coherence Problem n Cache coherent processors

n Reading processor must get the most current value

n Most current value is the last write Load A Load A Store A, 1 Load A n Cache coherency problem P0 P1

n updates from 1 processor not known to others

A 10 A 0 n Mechanisms for maintaining cache coherency

n Coherency state associated with a block of data

n Bus/Interconnect operations on shared data change the state Memory n For the processor that initiates an operation

n For other processors that have the data of the operation resident in their caches

58 Possible Causes of Incoherence n Sharing of writeable data

n Cause most commonly considered n Process migration

n Can occur even if independent jobs are executing n I/O

n Often fixed via OS cache flushes

59 Defining Coherent Memory System n A memory system is coherent if 1. A read R from address X on processor P1 returns the value written by the most recent write W to X on P1 if no other processor has written to X between W and R. 2. If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write. 3. Writes to the same location are serialized: two writes to location X are seen in the same order by all processors.

60 Cache Coherence Definition

n Property 1. preserves program order

n It says that in the absence of sharing, each processor behaves as a uniprocessor would

n Property 2. says that any write to an address must eventually be seen by all processors

n If P1 writes to X and P2 keeps reading X, P2 must eventually see the new value

n Property 3. preserves causality

n Suppose X starts at 0. Processor P1 increments X and processor P2 waits until X is 1 and then increments it to 2. Processor P3 must eventually see that X becomes 2.

n If different processors could see writes in different order, P2 can see P1’s write and do its own write, while P3 first sees the write by P2 and then the write by P1. Now we have two processors that will forever disagree about the value of A.

61 Maintaining Cache Coherence

n Snooping Solution (Snoopy Bus):

n Send all requests for data to all processors

n Processors snoop to see if they have a copy and respond accordingly

n Requires broadcast, since caching information is at processors

n Works well with bus (natural broadcast medium)

n Dominates for small scale machines (most of the market)

n Directory-Based Schemes

n Keep track of what is being shared in one centralized place

n Distributed memory è distributed directory (avoids bottlenecks)

n Send point-to-point requests to processors

n Scales better than Snoop

n Actually existed BEFORE Snoop-based schemes

62 Snooping vs. Directory-based (1/3)

n Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors

n The drawback is that snooping is not scalable

n Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow

n In broadcast snoop systems the coherency traffic is proportional to: N×(N-1) where N is the number of coherent masters

n For each master the broadcast goes to all other masters except itself,

n So coherency traffic for 1 master is proportional to N-1

63 Snooping vs. Directory-based (2/3)

n Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast

n In the best case if all shared data is shared only by two masters and we count the directory lookup and the snoop as separate transactions then traffic scales at order 2N

n In the worst case where all traffic is shared by all masters, then a directory doesn’t help and the traffic scales at order: N((N-1)+1) = N2, where the ‘+1’ is the directory lookup

n In reality, data is probably rarely shared amongst more than 2 masters except in certain special-case scenario

n For this reason, many of the larger systems (>64 processors) use this type of cache coherence

64 Snooping vs. Directory-based (3/3) n Actually, these two schemes are really two ends of a continuum of approaches n A snoop based system can be enhanced with snoop filters that can filter out unnecessary broadcast snoops by using partial directories

n Thus snoop filters enable larger scaling of snoop-based systems n A directory-based system is akin to a snoop-based system with perfect, fully populated snoop filters

65 Snooping

n Typically used for bus-based (SMP) multiprocessors

n Serialization on the bus used to maintain coherence property 3

n Two flavors

n Write-update (write broadcast)

n A write to shared data is broadcast to update all copies

n All subsequent reads will return the new written value (property 2)

n All see the writes in the order of broadcasts One bus == one order seen by all (property 3)

n Write-invalidate

n Write to shared data forces invalidation of all other cached copies

n Subsequent reads miss and fetch new value (property 2)

n Writes ordered by invalidations on the bus (property 3)

66 Update vs. Invalidate

n A burst of writes by a processor to one address

n Update: each sends an update

n Invalidate: possibly only the first invalidation is sent

n Writes to different words of a block

n Update: update sent for each word

n Invalidate: possibly only the first invalidation is sent

n Producer-consumer communication latency

n Update: producer sends an update, consumer reads new value from its cache

n Invalidate: producer invalidates consumer’s copy, consumer’s read misses and has to request the block

n Which is better depends on application

n But write-invalidate is simpler and implemented in most MP-capable processors today

67 Cache Coherency Protocols n Invalidation based protocol

n Simple 2-state write-through invalidate protocol

n 3-state (MSI) write-back invalidate protocol

n 4-state MESI write-back invalidate protocol

n 5-state MOESI write-back invalidate protocol

n And many variants

n Update based protocol

n Dragon

n …

68 2-State Invalidate Protocols

Store / OwnGETX n Write-through caches OtherGETS / -- Load / -- n invalidation-based protocols

n The snooping cache monitors the bus for writes Valid n If it detects that another processor has written to a OtherGETX / -- block it is caching, it Load / OwnGETS invalidates its copy Invalid Store / OwnGETX n This requires each cache controller to perform a tag OtherGETS / -- match operation OtherGETX / -- n Cache tags can be made dual- ported

69 3-State Write-Back Invalidate Protocol

n 2-State Protocol

n + Simple hardware and protocol

n – Uses lots of bandwidth (every write goes on bus!)

n 3-State Protocol (MSI)

n Modified

n One cache exclusively has valid (modified) copy Owner

n Memory is stale

n Shared

n >= 1 cache and memory have valid copy (memory = owner)

n Invalid (only memory has valid copy and memory is owner)

n Must invalidate all other copies before entering modified state

n Requires bus transaction (order and invalidate)

70 MSI Processor and Bus Actions

n Processor Actions:

n Load: load data in the cache line

n Store: store data into the cache line

n Eviction: processor wants to replace cache block

n Bus Actions:

n GETS: request to get data in shared state

n GETX: request for data in modified state (i.e., eXclusive access)

n UPGRADE: request for exclusive access to data owned in shared state

n Cache Controller Actions:

n Source : this cache provides the data to the requesting cache (your copy is more recent than the copy in memory)

n Writeback: this cache updates the block in memory

71 MSI Snoopy Protocol

Load / GETS

Eviction Invalid Shared GETX, UPGRADE Store / GETX

GETS Eviction / SOURCE Store / / WRITEBACK UPGRADE

GETS Modified / SOURCE, WRITEBACK

All edges are labeled with the activity that causes the transition; any value after the / represents an action place on the bus. All edges not shown are self edges that perform no actions (or are actions that are not possible)

72 4-State MESI Invalidation Protocol n MSI + New state: “Exclusive”

n data that is clean and unique copy (matches memory) n Benefit: bandwidth reduction

Cache Line State MSI Cache Line State MESI

Valid (I)nvalid Valid (I)nvalid

Clean Dirty Clean Dirty

(S)hared (M)odified (S)hared (E)xclusive (M)odified

73 MESI Protocol n Let's consider what happens if we read a block and then subsequently wish to modify it

n This will require two bus transactions using the 3-state MSI protocol

n But if we know that we have the only copy of the block, the transaction required to transition from state S to M is really unnecessary

n We could safely, and silently, transition from S to M

n E ® M transition doesn’t require bus transaction

n Improvement over MSI depends on the number of E®M transitions

74 MOESI Protocol n MESI + New state: “Owned”

n data that is both modified and shared n Benefit: bandwidth reduction

Cache Line State Cache Line State MESI MOESI

Valid (I)nvalid Valid (I)nvalid

Clean Dirty Clean Dirty

(S)hared (E)xclusive (M)odified (S)hared (E)xclusive (M)odified (O)wned

75 MOESI Protocol n An important assumption:

n Cache-to-cache transfer is possible, so a cache with the data in the modified state can supply that data to another reader without transferring it to memory

n O(wned) state

n Other shared copies of this block exist, but memory is stale

n This cache (the owned) is responsible for supplying the data when it observes the relevant bus transaction

n This avoids the need to write modified data back to memory when another processor wants to read it

n Look at the M to S transition in the MSI protocol

76 Cache-to-Cache Transfers

n Problem

n P1 has block B in M state

n P2 wants to read B, puts a RdReq on bus

n If P1 does nothing, memory will supply the data to P2

n What does P1 do?

n Solution 1: abort/retry

n P1 cancels P2’s request, issues a write back

n P2 later retries RdReq and gets data from memory

n Too slow (two memory latencies to move data from P1 to P2)

n Solution 2: intervention

n P1 indicates it will supply the data (“intervention” bus signal)

n Memory sees that, does not supply the data, and waits for P1’s data

n P1 starts sending the data on the bus, memory is updated

n P2 snoops the transfer during the write-back and gets the block

77 Cache-to-Cache Transfers n Intervention works if some cache has data in M state

n Nobody else has the correct data, clear who supplies the data n What if a cache has requested data in S state

n There might be others who have it, who should supply the data?

n Solution 1: let memory supply the data

n Solution 2: whoever wins arbitration supplies the data

n Solution 3: A separate state similar to S that indicates there are maybe others who have the block in S state, but if anybody asks for the data we should supply it

78 Coherence in Distributed Memory Multiprocessors n Distributed memory systems are typically larger à bus- based snooping may not work well

n Option 1: software-based mechanisms – message-passing systems or software-controlled cache coherence

n Option 2: hardware-based mechanisms – directory-based cache coherence

79 Directory-Based Cache Coherence n Typically in distributed shared memory n For every local memory block, local directory has an entry n Directory entry indicates

n Who has cached copies of the block

n In what state do they have the block

Processor Processor Processor Processor & Caches & Caches & Caches & Caches

Memory I/O Memory I/O Memory I/O Memory I/O

Directory Directory Directory Directory

Interconnection network

80 Basic Directory Scheme

n K processors P1 P2 Pn-1 Pn

n With each cache-block in $ $ $ $ memory: K presence bits, 1 dirty-bit Interconnect n With each cache-block in Memory cache: 1 valid bit, and 1 dirty (owner) bit

Directory dirty bit presence bit n Read from main memory by processor i:

n If dirty-bit OFF then { read from main memory; turn p[i] ON; }

n If dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} n • Write to main memory by processor i:

n If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON; ... }

81 Directory Protocol n Similar to Snoopy Protocol: Three states

n Shared: ≥ 1 processors have data, memory up-to-date

n Uncached (no processor has it; not valid in any cache)

n Exclusive: 1 processor (owner) has data; memory out-of-date

n Terms: typically 3 processors involved

n Local node where a request originates

n Home node where the memory location of an address resides

n Remote node has a copy of a cache block, whether exclusive or shared

82 (Execution) Latency vs. Bandwidth

n Desktop processing

n Typically want an application to execute as quickly as possible (minimize latency)

n Server/Enterprise processing

n Often throughput oriented (maximize bandwidth)

n Latency of individual task less important

n ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

83 Implementing MP Machines n One approach: add sockets to your MOBO

n minimal changes to existing CPUs

n power delivery, heat removal and I/O not too bad since each chip has own set of pins and cooling

CPU0

CPU1

CPU2

CPU3

84 Chip-Multiprocessing n Simple SMP on the same chip

Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

85 Shared Caches

n Resources can be shared between CPUs

n ex. IBM Power 5 CPU0 CPU1

L2 cache shared between both CPUs (no need to keep two copies coherent)

L3 cache is also shared (only tags are on-chip; data are off-chip)

86 Benefits? n Cheaper than mobo-based SMP

n all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)

n less power than mobo-based SMP as well (communication on- die is more power-efficient than chip-to-chip communication) n Performance

n on-chip communication is faster n Efficiency

n potentially better use of hardware resources than trying to make wider/more OOO single-threaded CPU

87 Performance vs. Power n 2x CPUs not necessarily equal to 2x performance n 2x CPUs à ½ power for each

n maybe a little better than ½ if resources can be shared n Back-of-the-Envelope calculation:

n 3.8 GHz CPU at 100W

n Dual-core: 50W per CPU 3 3 3 n P µ V : Vorig /VCMP = 100W/50W à VCMP = 0.8 Vorig

n f µ V: fCMP = 3.0GHz

Benefit of SMP: Full power budget per socket!

88 Summary

n Cache Coherence

n Coordinate accesses to shared, writeable data

n Coherence protocol defines cache line states, state transitions, actions

n Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping Þ uniform memory access)

n Directory has extra data structure to keep track of state of all cache blocks

n Synchronization

n Locks and ISA support for atomicity

n Memory Consistency

n Defines programmers’ expected view of memory

n Sequential consistency imposes ordering on loads/stores

89 Multi-core

90 Multi-core Architectures

n SMPs on a single chip

n Chip Multi-Processors (CMP)

n Pros

n Efficient exploitation of available transistor budget

n Improves throughput and speed of parallelized applications

n Allows tight coupling of cores

n better communication between cores than in SMP

n shared caches

n Low power consumption

n low clock rates

n idle cores can be suspended

n Cons

n Only improves speed of parallelized applications

n Increased gap to memory speed

91 Multi-core Architectures n Design decisions

n Homogeneous vs. Heterogeneous

n Specialized accelerator cores

n SIMD

n GPU operations

n cryptography

n DSP functions (e.g. FFT)

n FPGA (programmable circuits)

n Access to memory

n own memory area (distributed memory)

n via cache hierarchy (shared memory)

n Connection of cores

n internal bus / cross bar connection

n Cache architecture

92 Multi-core Architectures: Examples

Local Local Core Core Core Core Core (2x SMT) Store Store

Core Core L1 L1 L1 L1 L1

L2 L2 L2

I/O Core Core

Local Local L3 L3 Store Store

Memory Memory Memory I/O Module 1 Module 2 Module

Homogeneous with Heterogeneous with shared caches and cross bar caches, local store and ring bus

93 Shared Cache Design

Core Core Core Core Core Core

L1 L1 L1 L1 L1 L1

Switch Switch Switch L2 L2 L2

Main Memory Main Memory

Traditional design Multi-core Architecture

Multiple single-cores Shared Caches on-chip with shared cache off-chip

94 Is a Multi-core really better off?

DEEP BLUE

480 chess chips Can evaluate 200,000,000 moves per second!!

95 IBM Watson Jeopardy! Competition (2011.2)

n POWER7 chips (2,880 cores) + 16TB memory

n Massively parallel processing

n Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction

96 Major Challenges for Multi-core Designs n Communication

n Memory hierarchy

n Data allocation (you have a large shared L2/L3 now)

n Interconnection network

n AMD HyperTransport

n Intel QPI

n Scalability

n Bus Bandwidth, how to get there? n Power-Performance — Win or lose?

n Borkar’s multi-core arguments

n 15% per core performance drop à 50% power saving

n Giant, single core wastes power when task is small

n How about leakage? n Process variation and yield n Programming Model

97 2 Duo n Homogeneous cores n Bus based on chip interconnect n Shared on-die Cache Memory n Traditional I/O

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Source: Intel Corp.

Large, shared set associative, prefetch, etc.

98 Core 2 Duo

99 Why Sharing on-die L2?

n What happens when L2 is too large?

100 CoreTM μArch — Wide Dynamic Execution

101 CoreTM μArch — Wide Dynamic Execution

102 CoreTM μArch — MACRO Fusion

n Common “Intel 32” instruction pairs are combined

n 4-1-1-1 decoder that sustains 7 μop’s per cycle

n 4+1 = 5 “Intel 32” instructions per cycle

103 Micro(-ops) Fusion (from M)

n A misnomer..

n Instead of breaking up an Intel32 instruction into μop, they decide not to break it up…

n A better naming scheme would call the previous techniques — “IA32 fission”

n To fuse

n Store address and store data μops

n Load-and-op μops (e.g. ADD (%esp), %eax)

n Extend each RS entry to take 3 operands

n To reduce

n micro-ops (10% reduction in the OOO logic)

n Decoder bandwidth (simple decoder can decode fusion type instruction)

n Energy consumption

n Performance improved by 5% for INT and 9% for FP ( data)

104 Smart Memory Access

105 AMD Quad-Core Processor (Barcelona)

On different power plane from the cores

ø Source: AMD

n True 128-bit SSE (as opposed 64 in prior Opteron) n Sideband Stack optimizer n Parallelize many POPes and PUSHes (which were dependent on each other)

n Convert them into pure loads/store instructions n No uops in FUs for stack pointer adjustment

106 Barcelona’s Cache Architecture

ø Source: AMD

107 Intel Dual-Core (First 45nm mprocessor)

• High K dielectric metal gate • Up to 12MB L2 • 47 new SSE4 ISA • > 3GHz

ø Source: Intel

108 Intel Arrandale Processor

• 32nm • Unified 3MB L3 • Power sharing (Turbo Boost) between cores and GFX via DFS

109 AMD 12-Core “Magny-Cours” Opteron

n 45nm

n 4 memory channels

110 111 IBM Power8 Technology Cores § 22nm SOI, eDRAM, 15ML 650mm2 Caches § 12 cores (SMT8) § 512KB SRAM L2 / core § 8 dispatch, 10 issue, 16 § 96MB eDRAM shared L3 exec pipe § Up to 128MB eDRAM L4 § 2X internal data (off-chip) flows/queues § Enhanced prefetching Memory § 64K data cache, 32K § Up to 230 GB/s instruction cache sustained bandwidth

Accelerators Bus Interfaces § Durable open memory § Crypto & memory attach interface expansion § Integrated PCIe Gen3 § Transactional Memory Energy Management § SMP Interconnect § VMM assist § On-chip Power Management Micro-controller § CAPI (Coherent § Data Move / VM Mobility § Integrated Per-core VRM Accelerator Processor § Critical Path Monitors Interface)

112 IBM Power8 Core

Execution Improvement vs. Larger Caching Structures POWER7 vs. POWER7 § SMT4 à SMT8 § 2x L1 data cache (64 KB) § 8 dispatch § 2x outstanding data cache § 10 issue misses § 16 execution pipes: § 4x translation Cache • 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR Wider Load/Store § Larger Issue queues (4x16 – § 32B à 64B L2 to L1 data bus entry) § 2x data cache to execution § Larger global completion, dataflow Load/Store reorder § Improved branch prediction Bus Enhanced Prefetch § Improved unaligned storage access § Instruction speculation awareness Core Performance vs. POWER7 § Data prefetch depth awareness ~1.6x Single Thread § Adaptive bandwidth awareness ~2x Max SMT § Topology awareness

113 POWER8 On Chip Caches

n L2: 512 KB 8 way per core

n L3: 96 MB (12 x 8 MB 8 way Bank)

n “NUCA” Cache policy (Non-Uniform Cache Architecture)

n Scalable bandwidth and latency

n Migrate “hot” lines to local L2, then local L3 (replicate L2 contained footprint)

n Chip interconnect: 150 GB/sec per direction per segment

114 Cache Bandwidths

n GB/sec shown assuming 4 GHz

n Product frequency will vary based on model type

n Across 12 core chip

n 4 TB/sec L2 BW

n 3 TB/sec L3 BW

115 POWER8 Memory Organization

n Up to 8 high speed channels, each running up to 9.6 Bg/s for up to 230 GB/s sustained

n Up to 32 total DDR ports yielding 410 GB/s peak at the DRAM

n Up to 1 TB memory capacity per fully configured processor socket (at initial launch)

116 POWER8 Memory Buffer Chip

n Intelligence Moved into Memory

n Scheduling logic, caching structures

n Energy Mgmt., RAS decision point

n Formerly on processor

n Moved to Memory Buffer n Processor Interface

n 9.6 GB/s high speed interface

n More robust RAS

n “On-the-fly” lane isolation/repair

n Extensible for innovation build-out n Performance Value

n End-to-end fastpath and data retry (latency)

n Cache à latency/bandwidth, partial updates

n Cache à write scheduling, prefetch, energy

n 22nm SOI for optimal performance/energy

n 15 metal levels (latency, bandwidth)

117 POWER8 Integrated PCIe Gen 3

n Native PCI Gen 3 Support

n Direct processor integration

n Replaces proprietary GX/Bridge

n Low latency

n High Gen 3 bandwidth (8 Gb/s) à High utilization realizable

n Transport Layer for CAPI Protocol

n Coherently Attach Devices connected via PCI

n Protocol encapsulated in PCI

118 POWER8 CAPI (Coherence Attach Processor Interface)

n Virtual Addressing

n Accelerator can work with same memory addresses that the processor use

n Pointers de-referenced same as the host application

n Removes OS & device driver overhead n Hardware Managed Cache Coherence

n Enables the accelerator to participate in “Locks” as a normal thread Lowers Latency over IO communication model

PCI Gen 3 Transport for encapsulated messages

n Processor Service Layer (PSL)

n Present robust, durable interfaces to n Customizable Hardware Application Accelerator applications n Offload complexity / content from CAPP n Specific system SW, middleware, or user application

n Written to durable interface provided by PSL

119 POWER8 n Significant performance at thread, core and system n Optimization for VM density & efficiency n Strong enablement of autonomic system optimization n Excellent Big Data analytics capability

120 Summary

n High frequency -> high power consumption

n Trend towards multiple cores on chip

n Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, …

n Problem: memory latency and bandwidth

121 ARM MPCore Intra-Cluster Coherency Technology & ACE

122 ARM MPCore Intra-Cluster Coherency Technology

TM n ARM introduced MPCore multi-core coherency technology in the ARM11 MPCore and subsequently in the Cortex-A5 and Cortex-A9 MPCore, which enables cache-coherency within a cluster of 2 to 4 processors

Configurable HW Private FIQ interrupt lines lines

Interrupt Distributor

Timer CPU Timer CPU Timer CPU Timer CPU Wdog Interface Wdog Interface Wdog Interface Wdog Interface

IRQ IRQ IRQ IRQ

Configurable CPU/VFP CPU/VFP CPU/VFP CPU/VFP Between 1 and 4 L1 Memory L1 Memory L1 Memory L1 Memory Symmetric CPU

I&D 64-bit Coherency Snoop Control Unit (SCU) bus Control bus

123 ARM MPCore Intra-Cluster Coherency Technology

n In Cortex-A15 MPCore, it is extended with AMBA4 ACE coherency capability thus supports

n Multiple CPU clusters enabling systems containing more than 4 cores

n Heterogeneous systems consisting of multiple CPUs and cached accelerators

n An improved MESI protocol:

n Enables direct cache-to-cache copy of clean data and direct cache-to-cache move of dirty data within the cluster, without write back to memory as in normal MESI-based processor

n Further enhanced by the ‘Snoop Control Unit’ (SCU), which maintains a copy of all L1 data cache tag RAMs acting as a local, low-latency directory, enabling it to direct transfers only to the L1 caches as needed

n This increases performance as unnecessary snoop traffic to L1 caches would increase effective processor L1 access latency by reducing processor access to the L1 cache

124 ARM MPCore Intra-Cluster Coherency Technology n Also supports an optional Accelerator Coherency Port (ACP), which enables un-cached accelerators access to the processor cache hierarchy, enabling ‘one-way’ coherency where the accelerator can read and write data within the CPU caches without a write-back to RAM n But ACP cannot support cached accelerators since the CPU has no way to snoop accelerator caches, and the accelerator caches may contain stale data if the CPU writes accelerator-cached data n Effectively the ACP acts like an additional master port into the SCU and the ACP interface consists of a regular AXI3 slave interface

125 Different meanings of “protocol”

n Cache coherent protocols

n System communication policies

n ACE protocol

n Interface communication protocol

n Interconnect responsibilities

n ACE protocol does not guarantee coherency => ACE is a support for coherency

126 Different Kinds of Components n Interconnect: called CCI

(Cache Coherent GIC-500 10-40 DSPDSP Interconnect) GbE PCIe DSP DPI Crypto USB SATA Quad Quad Quad Quad Cortex- Cortex- Cortex- Cortex- n ACE Masters: masters with A57 A57 A57 A57 NIC-400 L2 L2 L2 L2 caches Cache Cache Cache Cache MMU-500 n ACE-Lite Masters: CoreLink CCN-504 Cache Coherent Network components without Snoop 8-16MB L3 Cache Filter caches snooping other

CoreLink CoreLink NIC-400 Network Interconnect caches DMC-520 DMC-520 n ACE-Lite/AXI Slaves: X72 X72 Flash GPIO DDR4- DDR4- components not initiating 3200 3200 snoop transactions

127 ACE Cache Coherency States

n ACE states of a cache line: 5-state cache model

n Each cache line is either Valid or Invalid

n The ACE states can be mapped directly onto the MOESI cache coherency model states, however ACE is designed to support components that use a variety of internal cache state models, including MESI, MOESI, MEI and others

Valid Invalid ARM ACE MOESI ACE Meaning Not shared, dirty, must be written Unique Shared UniqueDirty M Modified back Unique Shared Shared, dirty, must be written back SharedDirty O Owned Dirty Dirty Dirty to memory (UD) (SD) Invalid UniqueClean E Exclusive Not shared, clean Unique Shared Clean Shared, no need to write back, may Clean Clean SharedClean S Shared be clean or dirty (UC) (SC) Invalid I Invalid Invalid

128 ACE Cache Coherency States

n ACE does not prescribe the cache states a component can use à Some components may not support all ACE transactions

n ARM Cortex--A15 MPCore internally uses MESI states for the L1 data cache meaning the cache cannot be in the SharedDirty (Owned) state

n To emphasize that ACE is not restricted to the MOESI cache state model, ACE does not use the familiar MOESI terminology

Valid Invalid ARM ACE MOESI ACE Meaning Not shared, dirty, must be written Unique Shared UniqueDirty M Modified back Unique Shared Shared, dirty, must be written back Dirty SharedDirty O Owned Dirty Dirty to memory Invalid UniqueClean E Exclusive Not shared, clean Unique Shared Clean Shared, no need to write back, may Clean Clean SharedClean S Shared be clean or dirty Invalid I Invalid Invalid

129 ACE Design Principle n Lines held in more than one cache must be held in the Shared state n Only one copy can be in the SharedDirty state, and that is the one that is responsible for updating memory n Devices are not required to support all 5 states in the protocol internally à Flexible n System interconnect is responsible for coordinating the progress of all shared (coherent) transactions and can handle these in various manners, e.g.

n The interconnect may present snoop addresses to all masters in parallel simultaneously, or it may present snoop addresses one at a time serially

130 ACE Design Principle n System interconnect may choose either

n to perform speculative reads to lower latency,

n or to wait until snoop responses have been received to reduce system power consumption by minimizing external memory reads n The interconnect may include a directory or snoop filter, or it may broadcast snoops to all masters n ACE has been designed to enable performance and power optimizations by avoiding wherever possible unnecessary external memory accesses

n ACE facilitates direct master-to-master data transfer wherever possible

131 ACE Additional Signals and Channels

n AMBA 4 ACE is backwards- ARADDR compatible with AMBA 4 AXI adding additional signals and channels to the AMBA 4 AXI RDATA interface

n The AXI interface consists of 5 channels AWADDR

n In AXI, the read and write channels each have their own dedicated address and control WDATA channel

n The BRESP channel is used to indicate the completion of write transactions BRESP

132 ACE Additional Signals and Channels

AXI4 Channel Signal Source Description ARDOMAIN[1:0] Master Indicates the shareability domain of a read transaction Indicates the transaction type for Shareable read Read address ARSNOOP[3:0] Master transactions ARBAR[1:0] Master Indicates a read barrier transaction AWDOMAIN[1:0] Master Indicates the shareability domain of a write transaction Indicates the transaction type for Shareable write AWSNOOP[2:0] Master transactions Write address AWBAR[1:0] Master Indicates a write barrier transaction Indicates that a line is permitted to be held in a Unique AWUNIQUE Master state Read response. The additional read response bits Read data RRESP[3:2] Slave provide the information required to complete a Shareable read transaction.

133 ACE Additional Signals and Channels

n 3 new channels are supported, these are the snoop address channel, the snoop data channel, and the snoop response channel

n The snoop address (AC) channel is an input to a cached master that provides the address and associated control information for snoop transactions

n The snoop response (CR) channel is an output channel from a cached master that provides a response to a snoop transaction

n Every snoop transaction has a single response associated with it

n The snoop response indicates if an associated data transfer on the CD channel is expected

n The snoop data (CD) channel is an optional output channel that passes snoop data out from a master

n Typically, this occurs for a read or clean snoop transaction when the master being snooped has a copy of the data available to return

134 ACE Additional Signals and Channels

ACE-specific Signal Source Description Channel ACVALID Slave Snoop address and control information is valid ACREADY Master Snoop address ready Snoop address ACADDR[ac-1:0] Slave Snoop address ACSNOOP[3:0] Slave Snoop transaction type ACPROT[2:0] Slave Snoop protection type CRVALID Master Snoop response valid Snoop CRREADY Slave Snoop response ready response CRRESP[4:0] Master Snoop response CDVALID Master Snoop data valid CDREADY Slave Snoop data ready Snoop data CDDATA[cd-1:0] Master Snoop data CDLAST Master Indicates the last data transfer of a snoop transaction RACK Master Read acknowledge WACK Master Write acknowledge

135 ACE Transactions

ACE-Lite Transaction subset

ReadClean CleanShared Read ReadNotSharedDirty CleanInvalid Write ReadShared MakeInvalid Non-shared Read Shareable Cache Maintenance

WriteBack ReadOnce ReadUnique WriteClean WriteUnique CleanUnique Update memory WriteLineUnique MakeUnique Non-cached Write Shareable Evict Snoop Filter

136 Summary: ACE

n ACE states of a cache line: 5-state cache model

Dirty Clean

n ACE_UD ACE_UC Unique ACE channels ACE_I ACE_SD ACE_SC Shared n Read Channels ( AR, R ) Invalid Valid n Write Channels ( AW, W, B )

n Snoop Channels ( AC, CR, CD )

n ACE supported policies

n 100% snoop

n Directory based

n Anything between (snoop filter)

137 Appendix

138 Non-Uniform Cache Architecture n ASPLOS 2002 proposed by UT-Austin n Facts

n Large shared on-die L2

n Wire-delay dominating on-die cache

3 cycles 11 cycles 24 cycles 1MB 4MB 16MB 180nm, 1999 90nm, 2004 50nm, 2010

139 Multi-banked L2 cache

Bank=128KB 11 cycles

2MB @ 130nm

Bank Access time = 3 cycles Interconnect delay = 8 cycles

140 Multi-banked L2 cache

Bank=64KB 47 cycles

16MB @ 50nm

Bank Access time = 3 cycles Interconnect delay = 44 cycles

141 Static NUCA-1

Sub-bank

Bank Data Bus

Predecoder Address Bus Sense amplifier

Tag Array Wordline driver and decoder n Use private per-bank channel

n Each bank has its distinct access latency

n Statically decide data location for its given address

n Average access latency =34.2 cycles

n Wire overhead = 20.9% à an issue

142 Static NUCA-2

Tag Array Bank Switch

Data bus

Predecoder

Wordline driver and decoder n Use a 2D switched network to alleviate wire area overhead n Average access latency =24.2 cycles n Wire overhead = 5.9%

143 Multiprocessors

n Shared-memory Multiprocessors

n Provide a shared-memory abstraction

n Enables familiar and efficient programmer interface

P1 P2 P3 P4

Cache M1 Cache M1 Cache M1 Cache M1

Interface Interface Interface Interface

Interconnection Network

144 Processors and Memory – UMA

n Uniform Memory Access (UMA)

n Access all memory locations with same latency

n Pros: Simplifies software. Data placement does not matter

n Cons: Lowers peak performance. Latency defined by worst case

n Implementation: Bus-based UMA for symmetric multiprocessor (SMP)

CPU($) CPU($) CPU($) CPU($)

Mem Mem Mem Mem

145 Processors and Memory – NUMA

n Non-Uniform Memory Access (NUMA)

n Access local memory locations faster

n Pros: Increases peak performance.

n Cons: Increases software complexity, data placement.

n Implementation: Network-based NUMA with various network topologies, which require routers (R).

CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R

146 Networks and Topologies

n Shared Networks n Point-to-Point Networks

n Every CPU can communicate with n Every CPU can talk to specific every other CPU via bus or neighbors (depending on crossbar topology).

n Pros: lower latency n Pros: higher bandwidth and

n Cons: lower bandwidth and more easier to scale with processor difficult to scale with processor count (e.g., 100s) count (e.g., 16) n Cons: higher multi-hop latencies

CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem

Mem R R Mem CPU($) CPU($)

147 Topology 1 – Bus

n Network Topology

n Defines organization of network nodes

n Topologies differ in connectivity, latency, bandwidth, and cost.

n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost with p, etc…

n Bus

n Direct interconnect style

n Latency: f(1) wire delay

n Bandwidth: f(1/p) and not scalable (p<=4)

n Cost: f(1) wire cost

n Supports ordered broadcast only

148 Topology 2 – Crossbar Switch

n Network Topology

n Defines organization of network nodes

n Topologies differ in connectivity, latency, bandwidth, and cost.

n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost with p, etc…

n Crossbar Switch

n Indirect interconnect.

n Switches implemented as big multiplexors

n Latency: f(1) constant latency

n Bandwidth: f(1)

n Cost: f(2P) wires, f(P2) switches

149 Topology 3 – Multistage Network n Network Topology

n Defines organization of network nodes

n Topologies differ in connectivity, latency, bandwidth, and cost.

n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost with p, etc…

n Crossbar Switch

n Indirect interconnect.

n Routing done by address decoding

n k: switch arity (#inputs or #outputs)

n d: number of network stages = logkP n Latency: f(d)

n Bandwidth: f(1)

n Cost: f(d´P/k) switches, f(P´d) wires

n Commonly used in large UMA systems

150 Topology 4 – 2D Torus

n Network Topology

n Defines organization of network nodes

n Topologies differ in connectivity, latency, bandwidth, and cost.

n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost with p, etc…

n 2D Torus

n Direct interconnect 1/2 n Latency: f(P )

n Bandwidth: f(1)

n Cost: f(2P) wires

n Scalable and widely used.

n Variants: 1D torus, 2D mesh, 3D torus

151 Challenges in Shared Memory n Cache Coherence

n “Common Sense”

P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns X P1-Write[X] ® P2-Read[X] Read returns value written by P1 P1-Write[X] ® P2-Write[X] Writes serialized All P’s see writes in same order

n Synchronization

n Atomic read/write operations

n Memory Consistency

n What behavior should programmers expect from shared memory?

n Provide a formal definition of memory behavior to programmer

n Example: When will a written value be seen?

n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?

152 Example Execution

Processor 0 Processor 1 0: addi r1, accts, r3 CPU0 CPU1 Mem 1: ld 0(r3), r4 2: blt r4, r2, 6 3: sub r4, r2, r4 4: st r4, 0 (r3) 5: call give-cash 0: addi r1, accts, r3 # get addr for account 1: ld 0(r3), r4 # load balance into r4 2: blt r4, r2, 6 # check for sufficient funds 3: sub r4, r2, r4 # withdraw 4: st r4, 0(r3) #store new balance 5: call give-cash

n Two withdrawals from one account. Two ATMs

n Withdraw value: r2 (e.g., $100)

n Account memory address: accts+r1

n Account balance: r4

153 Scenario 1 – No Caches Processor 0 Processor 1 0: addi r1, accts, r3 P0 P1 Mem 1: ld 0(r3), r4 500 2: blt r4, r2, 6 500 3: sub r4, r2, r4 4: st r4, 0 (r3) 5: call give-cash0: addi r1, accts, r3 400 1: ld 0(r3), r4 400 2: blt r4, r2, 6 3: sub r4, r2, r4

4: st r4, 0(r3) 300 5: call give-cash

n Processors have no caches

n Withdrawals update balance without a problem

154 Scenario 2a – Cache Incoherence

Processor 0 Processor 1 0: addi r1, accts, r3 P0 P1 Mem 1: ld 0(r3), r4 500 2: blt r4, r2, 6 V: 500 500 3: sub r4, r2, r4 4: st r4, 0 (r3) 5: call give-cash 0: addi r1, accts, r3 D: 400 500 1: ld 0(r3), r4 2: blt r4, r2, 6 D: 400 V: 500 500 3: sub r4, r2, r4 4: st r4, 0(r3) D: 400 D: 400 500 5: call give-cash

n Processors have write-back caches

n Processor 0 updates balance in cache, but does not write-back to memory

n Multiple copies of memory location [accts+r1]

n Copies may get inconsistent

155 Scenario 2b – Cache Incoherence

Processor 0 Processor 1 0: addi r1, accts, r3 P0 P1 Mem 1: ld 0(r3), r4 500 2: blt r4, r2, 6 V: 500 500 3: sub r4, r2, r4 4: st r4, 0 (r3) 5: call give-cash 0: addi r1, accts, r3 D: 400 400 1: ld 0(r3), r4 2: blt r4, r2, 6 D: 400 V: 400 400 3: sub r4, r2, r4 4: st r4, 0(r3) V: 400 D: 400 300 5: call give-cash

n Processors have write-through caches

n What happens if processor 0 performs another withdrawal?

156 Hardware Coherence Protocols

n Absolute Coherence

n All cached copies have same data at same time. Slow and hard to implement CPU n Relative Coherence

n Temporary incoherence is ok (e.g., write-back caches) as long as no load reads incoherent data.

CC n Coherence Protocol D$ tags D$ D$ data D$ n Finite state machine that runs for every cache line

n Define states per cache line

n Define state transitions based on bus activity

bus n Requires coherence controller to examine bus traffic (address, data)

n (4) Invalidates, updates cache lines

157 Protocol 1 – Write Invalidate

n Mechanics

n Process P performs write, broadcasts address on bus

n !P snoop the bus. If address is locally cached, !P invalidates local copy

n Process P performs read, broadcasts address on bus

n !P snoop the bus. If address is locally cached, !P writes back local copy

n Example

Processor-Activity Bus-Activity Data in Cache-A Data in Cache-B Data in Mem[X] 0 CPU-A reads X Cache miss for X 0 0 CPU-B reads X Cache miss for X 0 0 0 CPU-A writes 1 to X Invalidation for X 1 0 CPU-B reads X Cache miss for X 1 1 1

158 Cache Coherent Systems

n Provide Coherence Protocol

n States

n State transition diagram

n Actions

n Implement Coherence Protocol

n (0) Determine when to invoke coherence protocol

n (1) Find state of cache line to determine action

n (2) Locate other cached copies

n (3) Communicate with other cached copies (invalidate, update)

n Implementation Variants

n (0) is done in the same way for all systems. Maintain additional state per cache line. Invoke protocol based on state

n (1-3) have different approaches

159 Implementation 1 – Snooping n Bus-based Snooping

n All cache/coherence controllers observe/react to all bus events.

n Protocol relies on globally visible events

n i.e., all processors see all events

n Protocol relies on globally ordered events

n i.e., all processors see all events in same sequence n Bus Events

n Processor (events initiated by own processor P)

n read (R), write (W), write-back (WB)

n Bus (events initiated by other processors !P)

n bus read (BR), bus write (BW)

160 Three-State Invalidate Protocol

n Implement protocol for every cache line.

n Add state bits to every cache to indicate (1) invalid, (2) shared, (3) exclusive

161 Example

P1 read (A) P2 read (A1) P1 write (B) P2 read (C) P1 write (D) P2 write (E) P2 write (F-Z)

162 Implementation 2 – Directory n Bus-based Snooping – Limitations

n Snooping scalability is limited

n Bus has insufficient data bandwidth for coherence traffic

n Processor has insufficient snooping bandwidth for coherence traffic n Directory-based Coherence – Scalable Alternative

n Directory contains state for every cache line

n Directory identifies processors with cached copies and their states

n In contrast to snoopy protocols, processors observe/act only on relevant memory events. Directory determines whether a processor is involved

163 Directory Communication

n Processor sends coherence events to directory 1. Find directory entry 2. Identify processors with copies 3. Communicate with processors, if necessary

164 Challenges in Shared Memory n Cache Coherence

n “Common Sense”

P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns X P1-Write[X] ® P2-Read[X] Read returns value written by P1 P1-Write[X] ® P2-Write[X] Writes serialized All P’s see writes in same order

n Synchronization

n Atomic read/write operations

n Memory Consistency

n What behavior should programmers expect from shared memory?

n Provide a formal definition of memory behavior to programmer

n Example: When will a written value be seen?

n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?

165 Synchronization n Regulate access to data shared by processors

n Synchronization primitive is a lock

n Critical section is a code segment that accesses shared data

n Processor must acquire lock before entering critical section.

n Processor should release lock when exiting critical section

n Spin Locks – Broken Implementation acquire (lock) # if lock=0, then set lock = 1, else spin critical section release (lock) # lock = 0

Inst-0: ldw R1, lock # load lock into R1 Inst-1: bnez R1, Inst-0 # check lock, if lock!=0, go back to Inst-0 Inst-2: stw 1, lock # acquire lock, set to 1 << critical section>>> # access shared data Inst-n: stw 0, lock # release lock, set to 0

166 Implementing Spin Locks

Processor 0 Processor 1 Inst-0: ldw R1, lock Inst-1: bnez R1,Inst-0 # P0 sees lock is free Inst-0: ldw R1, lock Inst-1: bnez R1, Inst-0 # P1 sees lock is free Inst-2: stw 1, lock # P0 acquires lock Inst-2: stw 1, lock # P1 acquires lock ….. …. # P0/P1 in critical section …. # at the same time Inst-n: stw 0, lock n Problem: Lock acquire not atomic

n A set of atomic operations either all complete or all fail. During a set of atomic operations, no other processor can interject.

n Spin lock requires atomic load-test-store sequence

167 Implementing Spin Locks

n Solution: Test-and-set instruction

n Add single instruction for load-test-store (t&s R1, lock)

n Test-and-set atomically executes ld R1, lock; # load previous lock value st 1, lock;# store 1 to set/acquire

n If lock initially free (0), t&s acquires lock (sets to 1)

n If lock initially busy (1), t&s does not change it

n Instruction is un-interruptible/atomic by definition

Inst-0 t&s R1, lock # atomically load, check, and set lock=1 Inst-1 bnez R1 # if previous value of R1 not 0, …. acquire unsuccessful Inst-n stw R1, 0 # atomically release lock

168 Test-and-Set Inefficiency

n Test-and-set works…

Processor 0 Processor 1 Inst-0: t&s R1, lock Inst-1: bnez R1,Inst-0 Inst-0: t&s R1, lock # P0 sees lock is free Inst-1: bnez R1, Inst-0 # P1 does not acquire

n …but performs poorly

n Suppose Processor 2 (not shown) has the lock

n Processors 0/1 must…

n Execute a loop of t&s instructions

n Issue multiple store instructions

n Generate useless interconnection traffic

169 Test-and-Test-and-Set Locks n Solution: Test-and-test-and-set

Inst-0 ld R1, lock # test with a load, see if lock changed Inst-1 bnez R1, Inst-0 # if lock=1, spin Inst-2 t&s R1, lock # if lock=1, test-and-set Inst-4 bnez R1, Inst-0 # if can not acquire, spin n Advantages

n Spins locally without stores

n Reduces interconnect traffic

n Not a new instruction, simply new software (lock implementation)

170 Semaphores n Semaphore (semaphore S, integer N)

n Allows N parallel threads to access shared variable

n If N = 1, equivalent to lock

n Requires atomic fetch-and-add

Function Init (semaphore S, integer N) { s = N; }

Function P (semaphore S) { # “Proberen” to test while (S == 0) { }; s = s -1 ; }

Function V (semaphore S) { # “Verhogen” to increment s = s + 1; }

171 Challenges in Shared Memory n Cache Coherence

n “Common Sense”

P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns X P1-Write[X] ® P2-Read[X] Read returns value written by P1 P1-Write[X] ® P2-Write[X] Writes serialized All P’s see writes in same order

n Synchronization

n Atomic read/write operations

n Memory Consistency

n What behavior should programmers expect from shared memory?

n Provide a formal definition of memory behavior to programmer

n Example: When will a written value be seen?

n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?

172 Memory Consistency n Execution Example

A = Flag = 0 Processor 0 Processor 1 A = 1 while (!Flag) Flag = 1 print A n Intuition – P1 should print A=1 n Coherence – Makes no guarantees!

173 Consistency and Caches n Execution Example

A = Flag = 0 Processor 0 Processor 1 A = 1 while (!Flag) Flag = 1 print A n Caching Scenario 1. P0 writes A=1. Misses in cache. Puts write into a store buffer. 2. P0 continues execution. 3. P0 writes Flag=1. Hits in cache. Completes write (with coherence) 4. P1 reads Flag=1. 5. P1 exits spin loop. 6. P1 prints A=0 n Caches, buffering, and other performance mechanisms can cause strange behavior

174 Sequential Consistency (SC)

n Definition of Sequential Consistency

n Formal definition of programmers’ expected view of memory (1) Each processor P sees its own loads/stores in program order (2) Each processor P sees !P loads/stores in program order (3) All processors see same global load/store ordering P and !P loads/stores may be interleaved into some order. But all processors see the same interleaving/ordering.

n Definition of Multiprocessor Ordering [Lamport]

n Multi-processor ordering corresponds to some sequential interleaving of uni-processor orderings. Multiprocessor ordering should be indistinguishable from multi-programmed uni- processor

175 Enforcing SC

n Consistency and Coherence

n SC Definition: loads/stores globally ordered

n SC Implications: coherence events of all load/stores globally ordered

n Implementing Sequential Consistency

n All loads/stores commit in-order

n Delay completion of memory access until all invalidations that are caused by access are complete

n Delay a memory access until previous memory access is complete

n Delay memory read until previous write completes. Cannot place writes in a buffer and continue with reads.

n Simple for programmer but constraints HW/SW performance optimizations

176 Weaker Consistency Models

n Assume programs are synchronized

n SC required only for lock variables

n Other variables are either (1) in critical section and cannot be accessed in parallel or (2) not shared

n Use fences to restrict re-ordering

n Increases opportunity for HW optimization but increases programmer effort

n Memory fences stall execution until write buffers empty

n Allows load/store reordering in critical section.

n Slows lock acquire, release

acquire memory fence critical section memory fence # ensures all writes from critical section release # are cleared from buffer

177