<<

EEC 581 Architecture

Multicore Architecture

Department of Electrical Engineering and Computer Science Cleveland State University

Multiprocessor Architectures

 Late 1950s - one general-purpose and one or more special-purpose processors for input and output operations  Early 1960s - multiple complete processors, used for program-level concurrency  Mid-1960s - multiple partial processors, used for instruction-level concurrency  Single-Instruction Multiple-Data (SIMD) machines  Multiple-Instruction Multiple-Data (MIMD) machines  A primary focus of this chapter is MIMD machines (multiprocessors)

1 Level Parallelism (TLP)

• Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads

(3)

InstructionInstr anducti Dataon a Streamsnd Data Streams

• Taxonomy due to M. Flynn

Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of Multiple MISD: MIMD: No examples today Intel Xeon e5345

Example: Multithreading (MT) in a single address space

3 (4)

Recall ExecutableRecall Th eFormat Executable Format 2

Object file ready to be linked and loaded

header text An executable static data Linker Loader instance or reloc symbol table debug Static Libraries What does a loader do?

(5) 4

2

Process

• A process is a running program DLL’s with state ! Stack, memory, open files Stack ! PC, registers • The keeps tracks of the state of all processors

! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details

(6)

3 Process Level Parallelism Process Level Parallelism Recall The Executable Format

Process Process Process

Object file ready to be linked and loaded

header

text An executable static data Linker Loader instance or reloc Process symbol table debug Static • Parallel proceLibrssearsie sa nd throughput computing • Each process itself does noWth raut ndo aens ya floaastederr do?

(7) 5 (5)

Process From Processes to Threads Process • Switching processes on a core is expensive • A pr!o ceA sslot iosf staa rteu ninnfionrmg aprtioongr toa bem managed DLL’s w• ithIf staI wtean t concurrency, launching a process is ! Setaxpeck,n msievmeo ry, open files Stack ! PC, registers • How about splitting up a single process into • Thepa operalrlealti congm sypustetatimo nks?ee ps tracks of the state of all processors " Lightweight processes or threads! ! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details (8)

(6) 6

4 3 3 Categories of Concurrency

 Categories of Concurrency:  Physical concurrency - Multiple independent processors ( multiple threads of control)  Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor ( can be designed as if there were multiple threads of control)  Coroutines (quasi-concurrency) have a single thread of control  A thread of control in a program is the sequence of program points reached as control flows through the program

Motivations for the Use of Concurrency

 Multiprocessor capable of physical concurrency are now widely used  Even if a machine has just one processor, a program written to use concurrent execution can be faster than the same program written for nonconcurrent execution  Involves a different way of designing software that can be very useful—many real-world situations involve concurrency  Many program applications are now spread over multiple machines, either locally or over a network

4 Introduction to Subprogram-Level Concurrency

 A task or process or thread is a program unit that can be in concurrent execution with other program units  Tasks differ from ordinary subprograms in that:  A task may be implicitly started  When a program unit starts the execution of a task, it is not necessarily suspended  When a task’s execution is completed, control may not return to the caller  Tasks usually work together

Two General Categories of Tasks

 Heavyweight tasks execute in their own address space  Lightweight tasks all run in the same address space – more efficient  A task is disjoint if it does not communicate with or affect the execution of any other task in the program in any way

5 Task Synchronization

 A mechanism that controls the order in which tasks execute  Two kinds of synchronization  Cooperation synchronization  Competition synchronization  Task communication is necessary for synchronization, provided by: - Shared nonlocal variables - Parameters - Message passing

Kinds of synchronization

 Cooperation: Task A must wait for task B to complete some specific activity before task A can continue its execution, e.g., the producer-consumer problem  Competition: Two or more tasks must use some resource that cannot be simultaneously used, e.g., a shared  Competition is usually provided by mutually exclusive access (approaches are discussed later)

6 Process Level Parallelism

Process Process Process

• Parallel processes and throughput computing • Each process itself does not run any faster

(7)

From ProcessesThre atod ThreadsParallel Execution

Process From Processes to Threads thread • Switching processes on a core is expensive ! A lot of state information to be managed • If I want concurrency, launching a process is expensive • How about splitting up a single process into parallel computations? " Lightweight processes or threads!

(9) 13

(8)

A Thread A Thread

• A separate, concurrently executable instruction 4 stream within a process • Minimum amount state to execute on a core ! , registers, stack Our ! Remaining state shared with the parent process o Memory and files so far! • Support for creating threads • Support for merging/terminating threads • Support for synchronization between threads ! In accesses to shared data

(10) 14

7 5 TLP  ILP of a single program is hard  Large ILP is Far-flung  We are human after all, program w/ sequential mind  Reality: running multiple threads or programs  Thread Level Parallelism  Time Multiplexing  Throughput computing  Multiple program workloads  Multiple concurrent threads  Helper threads to improve single program performance

15

Thread Level Parallelism (TLP) Thread Level Parallelism (TLP)

• Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads

16 (3)

8

Instruction and Data Streams

• Taxonomy due to M. Flynn

Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: MIMD: No examples today Intel Xeon e5345

Example: Multithreading (MT) in a single address space

(4)

2 Single and Multithreaded Processes

17

A Simple Example A Simple Example

Data Parallel Computation

18 (11)

9

Thread Execution: Basics

Thread #1

PC, registers, Stack stack pointer

Thread #2 create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer

funcA() funcB() Heap

end_thread() end_thread() Static data

WaitAllThreads() funcA() funcB()

(12)

6 Examples of Threads

 A web browser A Simple Example  One thread displays images  One thread retrieves data from network  A word processor  One thread displays graphics  One thread reads keystrokes  One thread performs spell checking in the backgroundData Parallel  A web server Computation  One thread accepts requests  When a request comes in, separate thread is created to service  Many threads to support thousands of client requests  RPC or RMI (Java)  One thread receives message  Message service uses another thread

(11) 19

Thread Execution Thread Execution: Basics

Thread #1

PC, registers, Stack stack pointer

Thread #2

create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer

funcA() funcB() Heap

end_thread() end_thread() Static data

WaitAllThreads() funcA() funcB()

(12) 20

106 ThreadsThre aExecutionds Executi onon ao nSingle a Sin glCoree Co re • Hardware threads ! Each thread has its own hardware state • Switching between threads on each cycle to share the core – why?

Thread #1 lw $t0, label($0) lw $t1, label1($0) IF ID EX MEM and $t2, $t0, $t1 WB andi $t3, $t1, 0xffff srl $t2, $t2, 12 lw …… Interleaved execution lw lw Improve utilization ! Thread #2 lw lw lw lw $t3, 0($t0) add $t2, $t2, $t3 add lw lw lw addi $t0, $t0, 4 and add lw lw lw addi $t1, $t1, -1 bne $t1, $zero, loop No on load-to-use hazard! …….

(13) 21

ExecutionExe cuModel:tion MMultithreadingodel: Multith reading An Example Datapath • Fine-grain multithreading ! threads after each cycle ! Interleave instruction execution • Coarse-grain multithreading ! Only switch on long stall (e.g., L2- miss) ! Simplifies hardware, but does not hide short stalls (e.g., data hazards) ! If one thread stalls (e.g., I/O), others are executed

22 From Poonacha Kongetira, of the UltraSPARC T1 CPU (14)

(17) 11 7

Simultaneous Multithreading

• In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads ! Schedule instructions from multiple threads ! Instructions from independent threads execute when function units are available • Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function units and caches ! Known as Hyperthreading in Intel terminology

(18)

9 Threads vs. Processes Thread Processes

 A thread has no data  A process has segment or heap code/data/heap and other  A thread cannot live on its segments own, it must live within a  There must be at least one process thread in a process  There can be more than one  Threads within a process thread in a process, the first share code/data/heap, share thread calls main and has I/O, but each has its own the process’s stack stack and registers  Inexpensive creation  Expense creation  Inexpensive context  Expensive context switching switching  It a process dies, its  If a thread dies, its stack is resources are reclaimed and reclaimed by the process all threads die

23

Thread Implementation

 Process defines address space TCB for thread1 Process’s address space $pc  Threads share address $sp Reserved State space DLL’s Registers … Stack – thread 1  Process Control Block (PCB) … contains process-specific info TCB for thread2 Stack – thread 2  PID, owner, heap pointer, $pc $sp active threads and pointers State to thread info Registers Heap … … Initialized data  Thread Control Block (TCB) contains thread-specific info CODE  Stack pointer, PC, thread state, register …

24

12 Benefits

 Responsiveness  When one thread is blocked, your browser still responds  E.g. download images while allowing your interaction  Resource Sharing  Share the same address space  Reduce overhead (e.g. memory)  Economy  Creating a new process costs memory and resources  E.g. in Solaris, 30 times slower in creating process than thread  Utilization of MP Architectures  Threads can be executed in parallel on multiple processors  Increase concurrency and throughput

User-level Threads

 Thread management done by user-level threads library

 Similar to calling a procedure

 Thread management is done by the thread library in user space

 User can control the thread scheduling (No disturbing the underlying OS scheduler)

 No OS kernel support  more portable  Low overhead when thread switching

 Three primary thread libraries:  POSIX Pthreads  Java threads  Win32 threads

13 Kernel Threads

 A.k.a. lightweight process in the literature

 Supported by the Kernel

 Thread scheduling is fairer

 Examples  Windows XP/2000  Solaris  Linux  Tru64 UNIX  Mac OS X

Multithreading Models

 Many-to-One

 One-to-One

 Many-to-Many

14 Many-to-One

 Many user-level threads mapped to one single kernel thread

 The entire process will block if a thread makes a blocking system call

 Cannot run threads in parallel on multiprocessors

 Examples  Solaris Green Threads  GNU Portable Threads

Many-to-One Model

15 One-to-One

 Each user-level thread maps to kernel thread

 Do not block other threads when one is making a blocking system call

 Enable parallel execution in an MP system

 Downside:  performance/memory overheads of creating kernel threads  Restriction of the number of threads that can be supported

 Examples  Windows NT/XP/2000  Linux  Solaris 9 and later

One-to-one Model

16 Many-to-Many Model

 Allows many user level threads to be mapped to many kernel threads

 Allows the operating system to create a sufficient number of kernel threads

 Threads are multiplexed to a smaller (or equal) number of kernel threads which is specific to a particular application or a particular machine

 Solaris prior to version 9  Windows NT/2000 with the ThreadFiber package

Many-to-Many Model

17 Pipeline Hazards

35

Multithreading

36

18 37

Multi-Tasking Paradigm  makes it easy FU1 FU2 FU3 FU4 Unused  Context switch could be Thread 1 Thread 2 expensive or requires extra Thread 3 HW Thread 4 Thread 5  VIVT cache  VIPT cache

Execution Time Quantum Time Execution  TLBs

Conventional Superscalar Single Threaded

38

19 Multi-threading Paradigm

FU1 FU2 FU3 FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4

Thread 5 Execution Time Execution

Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP or (SMT) Threaded Interleaving) MultiCore)

39

Conventional Multithreading  Zero-overhead context switch  Duplicated contexts for threads

0:r0

0:r7 1:r0

CtxtPtr 1:r7 2:r0

2:r7 3:r0

3:r7

Memory (shared by threads)

40

20 Cycle Interleaving MT

 Per-cycle, Per-thread instruction fetching  Examples: HEP, Horizon, Tera MTA, MIT M-machine  Interesting questions to consider  Does it need a sophisticated ?  Or does it need any at all?  Get rid of “branch prediction”?  Get rid of “predication”?  Does it need any out-of-order execution capability?

41

Block Interleaving MT

 Context switch on a specific event (dynamic pipelining)  Explicit switching: implementing a switch instruction  Implicit switching: trigger when a specific instruction class fetched  Static switching (switch upon fetching)  Switch-on-memory-instructions: Rhamma processor  Switch-on-branch or switch-on-hard-to-predict-branch  Trigger can be implicit or explicit instruction  Dynamic switching  Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle (MIT Alewife’s node), Rhamma Processor  Switch-on-use (lazy strategy of switch-on-cache-miss)  Wait until last minute  Valid bit needed for each register  Clear when load issued, set when data returned  Switch-on-signal (e.g. interrupt)  Predicated switch instruction based on conditions  No need to support a large number of threads

42

21 Simultaneous Multithreading (SMT)

 SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92]  Intel’s HyperThreading (2-way SMT)  IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores;  Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) Fetch Decode RS & ROB FMult Unit plus (4 cycles) Physical Reg RegReg RegisterRegister Register FAdd FileRegReg RegisterRegister (2 cyc) FileFileRegReg PCPC RRenameenameRegisterRegisterrr File FileFileReg PCPC RRenameenameRegisterRegisterrr FileFile PCPC RRenameenamerr File

PCPC RRenameenamerr ALU1

I-CACHE ALU2 Load/Store D-CACHE (variable)

43

Instruction Fetching Policy

 FIFO, Round Robin, simple but may be too naive  Adaptive Fetching Policies  BRCOUNT (reduce wrong path issuing)  Count # of br inst in decode/rename/IQ stages  Give top priority to thread with the least BRCOUNT  MISSCOUT (reduce IQ clog)  Count # of outstanding D-cache misses  Give top priority to thread with the least MISSCOUNT  ICOUNT (reduce IQ clog)  Count # of inst in decode/rename/IQ stages  Give top priority to thread with the least ICOUNT  IQPOSN (reduce IQ clog)  Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues  Due to that threads with the oldest instructions will be most prone to IQ clog  No Counter needed

44

22 Resource Sharing

 Could be tricky when threads compete for the resources

 Static  Less complexity  Could penalize threads (e.g. size)  P4’s Hyperthreading

 Dynamic  Complex  What is fair? How to quantify fairness?

 A growing concern in Multi-core processors  Shared L2, bandwidth, etc.  Issues  Fairness  Mutual thrashing

45

HyperThreading Hyper-threading

2 CPU Without Hyper-threading 2 CPU With Hyper-threading

Arch State Arch State Arch State Arch State Arch State Arch State

Processor Processor Processor Processor Execution Execution Execution Execution Resources Resources Resources Resources

• Implementation of Hyper-threading adds less than 5% to the chip area • Principle: share major logic components (functional units) and improve utilization • Architecture State: All core pipeline resources needed for executing a thread

46 19 (19)

23

Multithreading with ILP: Examples

(20)

10 P4 HyperThreading Resource Partitioning

 TC (or UROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss  op queue (into ½) after fetched from TC  ROB (126/2)  LB (48/2)  SB (24/2) (32/2 for Prescott)  General op queue and memory op queue (1/2)  TLB (½?) as there is no PID  Retirement: alternating between 2 logical processors

47

Alpha 21464 (EV8) Processor

Technology

 Leading edge process technology – 1.2 ~ 2.0GHz  0.125µm CMOS  SOI-compatible  Cu interconnect  low-k dielectrics

 Chip characteristics  ~1.2V Vdd  ~250 Million transistors  ~1100 signal pins in flip chip packaging

48

24 Alpha 21464 (EV8) Processor

Architecture

 Enhanced out-of-order execution (that giant 2Bc-gskew predictor we discussed before is here)  Large on-chip L2 cache  Direct RAMBUS interface  On-chip router for system interconnect  Glueless, directory-based, ccNUMA for up to 512-way SMP  8-wide superscalar  4-way simultaneous multithreading (SMT)  Total die overhead ~ 6% (allegedly)

49

SMT Pipeline

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store

PC

Register Map Regs Dcache Regs Icache

50

Source: A company once called Compaq

25 EV8 SMT

 In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB  Replicated hardware contexts  Program counter  Architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool)  Shared resources  Rename register pool (larger than needed by 1 thread)  Instruction queue  Caches  TLB  Branch predictors  Deceased before seeing the daylight.

51

Reality Check, circa 200x

 Conventional processor designs run out of steam  Power wall (thermal)  Complexity (verification)  Physics (CMOS scaling)

1000 Sun’s Surface

Nuclear Reactor Rocket 100 Nozzle

2 Hot plate

Pentium III ® processor “Surpassed hot-plate power processor

Watts/cm Pentium II ® 10 density in 0.5m; Not too long to reach nuclear reactor,” Pentium Pro ® processor Former Intel Fellow Fred i386 Pentium ® processor Pollack. i486 1          

52

26 Latest Power Density Trend

Yeo and Lee, “Peeling the Power Onion of Data Centers,” In Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011

53

Reality Check, circa 200x

 Conventional processor designs run out of steam  Power wall (thermal)  Complexity (verification)  Physics (CMOS scaling)  Unanimous direction  Multi-core  Simple cores (massive number)  Keep  Wire communication on leash  Gordon Moore happy (Moore’s Law)  Architects’ menace: kick the ball to the other side of the court?  What do you (or your customers) want?  Performance (and/or availability)  Throughput > latency (turnaround time)  Total cost of ownership (performance per dollar)  Energy ()  Reliability and dependability, SPAM/spy free

54

27 Multi-core Processor Gala

55

Intel’s Multicore Roadmap

8C 12MB 8C 12MB shared shared (45nm) (45nm) QC 8/16MB DC 3MB /6MB shared shared (45nm) DC 3 MB/6 MB shared QC 4MB (45nm) DC 4MB DC 2/4MB DC 16MB DC 2/4MB shared shared DC 2MB DC 4MB SC 1MB DC 2MB DC 2/4MB

SC 512KB/

Mobile processors Mobile processors Enterprise

Desktop processorsDesktop 1/ 2MB

2006 2007 2008 2006 2007 2008 2006 2007 2008

Source: Adapted from Tom’s Hardware

 To extend Moore’s Law  To delay the ultimate limit of physics  By 2010  all Intel processors delivered will be multicore  Intel’s 80-core processor (FPU array)

56

28 Is a Multi-core really better off?

If you were plowing a field, which would you rather use: Two strong oxen or 1024 cores? --- Seymour Cray

Well, it is hard to say in Computing World

57

Q1. For a PIPT cache with virtual memory support, three possible events can be triggered during an instruction fetch: (1) a cache lookup, (2) a TLB miss, (3) a page fault. Please order these events in the correct order of their occurrences.

(2) (3) (1): In a PIPT cache, address translation will take place prior to a cache lookup. It first searches for a match in TLB, therefore, a TLB miss (if any) will take place first. Then a page table walk is initiated. If a page has not been allocated, a page fault will follow. The OS will then allocate the page, fill in the page table entry, then fill the translation into TLB, followed by a cache lookup.

Q2. Given a 256Meg x4 DRAM chip which consists of 2 banks, with 14-bit row addresses. (256Meg indicates the number of addresses.) What is the row buffer size for each bank?

256M 28 address bits needed. One bit is used for bank index, hence, the column address = 28 – 1 – 14 = 13 bits As the DRAM is a “x4” configuration One row buffer of a bank = 213 * (x4 bits) = 32 kbits = 4 KB

58

29 Q3. Assume an Inverted Page Table (8-entry IPT) is used by a 32-bit OS. The memory page size is 256KB. The complete IPT content is shown below. The Physical Page Number (PPN) starts from 0 to 7 from the top of the table. There are three active processes, P1 (PID=1), P2 (PID=2) and P3 (PID=3) running in the system and the IPT holds the translation for the entire physical memory. Answer the following questions.

Based on the size of the Inverted Page Table above, what is the size of the physical memory ? There are 8 entries in the IPT. As each page is 256KB, the size of the physical memory = 8 * 256KB = 2MB

59

IBM Watson Jeopardy! Competition  POWER7 chips (2,880 cores) + 16TB memory  processing  Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction

60

30 Major Challenges for Multi-Core Designs

 Communication   Data allocation (you have a large shared L2/L3 now)  Interconnection network  AMD HyperTransport  Intel QPI   Bus Bandwidth, how to get there?  Power-Performance — Win or lose?  Borkar’s multicore arguments  15% per core performance drop  50% power saving  Giant, single core wastes power when task is small  How about leakage?  Process variation and  Programming Model

61

31