EEC 581 Computer Architecture
Multicore Architecture
Department of Electrical Engineering and Computer Science Cleveland State University
Multiprocessor Architectures
Late 1950s - one general-purpose processor and one or more special-purpose processors for input and output operations Early 1960s - multiple complete processors, used for program-level concurrency Mid-1960s - multiple partial processors, used for instruction-level concurrency Single-Instruction Multiple-Data (SIMD) machines Multiple-Instruction Multiple-Data (MIMD) machines A primary focus of this chapter is shared memory MIMD machines (multiprocessors)
1 Thread Level Parallelism (TLP)
• Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads
(3)
InstructionInstr anducti Dataon a Streamsnd Data Streams
• Taxonomy due to M. Flynn
Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: MIMD: No examples today Intel Xeon e5345
Example: Multithreading (MT) in a single address space
3 (4)
Recall ExecutableRecall Th eFormat Executable Format 2
Object file ready to be linked and loaded
header text An executable static data Linker Loader instance or reloc Process symbol table debug Static Libraries What does a loader do?
(5) 4
2
Process
• A process is a running program DLL’s with state ! Stack, memory, open files Stack ! PC, registers • The operating system keeps tracks of the state of all processors
! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details
(6)
3 Process Level Parallelism Process Level Parallelism Recall The Executable Format
Process Process Process
Object file ready to be linked and loaded
header
text An executable static data Linker Loader instance or reloc Process symbol table debug Static • Parallel proceLibrssearsie sa nd throughput computing • Each process itself does noWth raut ndo aens ya floaastederr do?
(7) 5 (5)
Process From Processes to Threads Process • Switching processes on a core is expensive • A pr!o ceA sslot iosf staa rteu ninnfionrmg aprtioongr toa bem managed DLL’s w• ithIf staI wtean t concurrency, launching a process is ! Setaxpeck,n msievmeo ry, open files Stack ! PC, registers • How about splitting up a single process into • Thepa operalrlealti congm sypustetatimo nks?ee ps tracks of the state of all processors " Lightweight processes or threads! ! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details (8)
(6) 6
4 3 3 Categories of Concurrency
Categories of Concurrency: Physical concurrency - Multiple independent processors ( multiple threads of control) Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control) Coroutines (quasi-concurrency) have a single thread of control A thread of control in a program is the sequence of program points reached as control flows through the program
Motivations for the Use of Concurrency
Multiprocessor computers capable of physical concurrency are now widely used Even if a machine has just one processor, a program written to use concurrent execution can be faster than the same program written for nonconcurrent execution Involves a different way of designing software that can be very useful—many real-world situations involve concurrency Many program applications are now spread over multiple machines, either locally or over a network
4 Introduction to Subprogram-Level Concurrency
A task or process or thread is a program unit that can be in concurrent execution with other program units Tasks differ from ordinary subprograms in that: A task may be implicitly started When a program unit starts the execution of a task, it is not necessarily suspended When a task’s execution is completed, control may not return to the caller Tasks usually work together
Two General Categories of Tasks
Heavyweight tasks execute in their own address space Lightweight tasks all run in the same address space – more efficient A task is disjoint if it does not communicate with or affect the execution of any other task in the program in any way
5 Task Synchronization
A mechanism that controls the order in which tasks execute Two kinds of synchronization Cooperation synchronization Competition synchronization Task communication is necessary for synchronization, provided by: - Shared nonlocal variables - Parameters - Message passing
Kinds of synchronization
Cooperation: Task A must wait for task B to complete some specific activity before task A can continue its execution, e.g., the producer-consumer problem Competition: Two or more tasks must use some resource that cannot be simultaneously used, e.g., a shared counter Competition is usually provided by mutually exclusive access (approaches are discussed later)
6 Process Level Parallelism
Process Process Process
• Parallel processes and throughput computing • Each process itself does not run any faster
(7)
From ProcessesThre atod ThreadsParallel Execution
Process From Processes to Threads thread • Switching processes on a core is expensive ! A lot of state information to be managed • If I want concurrency, launching a process is expensive • How about splitting up a single process into parallel computations? " Lightweight processes or threads!
(9) 13
(8)
A Thread A Thread
• A separate, concurrently executable instruction 4 stream within a process • Minimum amount state to execute on a core ! Program counter, registers, stack Our ! Remaining state shared with the parent process datapath o Memory and files so far! • Support for creating threads • Support for merging/terminating threads • Support for synchronization between threads ! In accesses to shared data
(10) 14
7 5 TLP ILP of a single program is hard Large ILP is Far-flung We are human after all, program w/ sequential mind Reality: running multiple threads or programs Thread Level Parallelism Time Multiplexing Throughput computing Multiple program workloads Multiple concurrent threads Helper threads to improve single program performance
15
Thread Level Parallelism (TLP) Thread Level Parallelism (TLP)
• Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads
16 (3)
8
Instruction and Data Streams
• Taxonomy due to M. Flynn
Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: MIMD: No examples today Intel Xeon e5345
Example: Multithreading (MT) in a single address space
(4)
2 Single and Multithreaded Processes
17
A Simple Example A Simple Example
Data Parallel Computation
18 (11)
9
Thread Execution: Basics
Thread #1
PC, registers, Stack stack pointer
Thread #2 create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer
funcA() funcB() Heap
end_thread() end_thread() Static data
WaitAllThreads() funcA() funcB()
(12)
6 Examples of Threads
A web browser A Simple Example One thread displays images One thread retrieves data from network A word processor One thread displays graphics One thread reads keystrokes One thread performs spell checking in the backgroundData Parallel A web server Computation One thread accepts requests When a request comes in, separate thread is created to service Many threads to support thousands of client requests RPC or RMI (Java) One thread receives message Message service uses another thread
(11) 19
Thread Execution Thread Execution: Basics
Thread #1
PC, registers, Stack stack pointer
Thread #2
create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer
funcA() funcB() Heap
end_thread() end_thread() Static data
WaitAllThreads() funcA() funcB()
(12) 20
106 ThreadsThre aExecutionds Executi onon ao nSingle a Sin glCoree Co re • Hardware threads ! Each thread has its own hardware state • Switching between threads on each cycle to share the core pipeline – why?
Thread #1 lw $t0, label($0) lw $t1, label1($0) IF ID EX MEM and $t2, $t0, $t1 WB andi $t3, $t1, 0xffff srl $t2, $t2, 12 lw …… Interleaved execution lw lw Improve utilization ! Thread #2 lw lw lw lw $t3, 0($t0) add $t2, $t2, $t3 add lw lw lw addi $t0, $t0, 4 and add lw lw lw addi $t1, $t1, -1 bne $t1, $zero, loop No pipeline stall on load-to-use hazard! …….
(13) 21
ExecutionExe cuModel:tion MMultithreadingodel: Multith reading An Example Datapath • Fine-grain multithreading ! Switch threads after each cycle ! Interleave instruction execution • Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss) ! Simplifies hardware, but does not hide short stalls (e.g., data hazards) ! If one thread stalls (e.g., I/O), others are executed
22 From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU (14)
(17) 11 7
Simultaneous Multithreading
• In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads ! Schedule instructions from multiple threads ! Instructions from independent threads execute when function units are available • Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function units and caches ! Known as Hyperthreading in Intel terminology
(18)
9 Threads vs. Processes Thread Processes
A thread has no data A process has segment or heap code/data/heap and other A thread cannot live on its segments own, it must live within a There must be at least one process thread in a process There can be more than one Threads within a process thread in a process, the first share code/data/heap, share thread calls main and has I/O, but each has its own the process’s stack stack and registers Inexpensive creation Expense creation Inexpensive context Expensive context switching switching It a process dies, its If a thread dies, its stack is resources are reclaimed and reclaimed by the process all threads die
23
Thread Implementation
Process defines address space TCB for thread1 Process’s address space $pc Threads share address $sp Reserved State space DLL’s Registers … Stack – thread 1 Process Control Block (PCB) … contains process-specific info TCB for thread2 Stack – thread 2 PID, owner, heap pointer, $pc $sp active threads and pointers State to thread info Registers Heap … … Initialized data Thread Control Block (TCB) contains thread-specific info CODE Stack pointer, PC, thread state, register …
24
12 Benefits
Responsiveness When one thread is blocked, your browser still responds E.g. download images while allowing your interaction Resource Sharing Share the same address space Reduce overhead (e.g. memory) Economy Creating a new process costs memory and resources E.g. in Solaris, 30 times slower in creating process than thread Utilization of MP Architectures Threads can be executed in parallel on multiple processors Increase concurrency and throughput
User-level Threads
Thread management done by user-level threads library
Similar to calling a procedure
Thread management is done by the thread library in user space
User can control the thread scheduling (No disturbing the underlying OS scheduler)
No OS kernel support more portable Low overhead when thread switching
Three primary thread libraries: POSIX Pthreads Java threads Win32 threads
13 Kernel Threads
A.k.a. lightweight process in the literature
Supported by the Kernel
Thread scheduling is fairer
Examples Windows XP/2000 Solaris Linux Tru64 UNIX Mac OS X
Multithreading Models
Many-to-One
One-to-One
Many-to-Many
14 Many-to-One
Many user-level threads mapped to one single kernel thread
The entire process will block if a thread makes a blocking system call
Cannot run threads in parallel on multiprocessors
Examples Solaris Green Threads GNU Portable Threads
Many-to-One Model
15 One-to-One
Each user-level thread maps to kernel thread
Do not block other threads when one is making a blocking system call
Enable parallel execution in an MP system
Downside: performance/memory overheads of creating kernel threads Restriction of the number of threads that can be supported
Examples Windows NT/XP/2000 Linux Solaris 9 and later
One-to-one Model
16 Many-to-Many Model
Allows many user level threads to be mapped to many kernel threads
Allows the operating system to create a sufficient number of kernel threads
Threads are multiplexed to a smaller (or equal) number of kernel threads which is specific to a particular application or a particular machine
Solaris prior to version 9 Windows NT/2000 with the ThreadFiber package
Many-to-Many Model
17 Pipeline Hazards
35
Multithreading
36
18 37
Multi-Tasking Paradigm Virtual memory makes it easy FU1 FU2 FU3 FU4 Unused Context switch could be Thread 1 Thread 2 expensive or requires extra Thread 3 HW Thread 4 Thread 5 VIVT cache VIPT cache
Execution Time Quantum Time Execution TLBs
Conventional Superscalar Single Threaded
38
19 Multi-threading Paradigm
FU1 FU2 FU3 FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4
Thread 5 Execution Time Execution
Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP or (SMT) Threaded Interleaving) MultiCore)
39
Conventional Multithreading Zero-overhead context switch Duplicated contexts for threads
0:r0
0:r7 1:r0
CtxtPtr 1:r7 2:r0
2:r7 3:r0
3:r7 Register file
Memory (shared by threads)
40
20 Cycle Interleaving MT
Per-cycle, Per-thread instruction fetching Examples: HEP, Horizon, Tera MTA, MIT M-machine Interesting questions to consider Does it need a sophisticated branch predictor? Or does it need any speculative execution at all? Get rid of “branch prediction”? Get rid of “predication”? Does it need any out-of-order execution capability?
41
Block Interleaving MT
Context switch on a specific event (dynamic pipelining) Explicit switching: implementing a switch instruction Implicit switching: trigger when a specific instruction class fetched Static switching (switch upon fetching) Switch-on-memory-instructions: Rhamma processor Switch-on-branch or switch-on-hard-to-predict-branch Trigger can be implicit or explicit instruction Dynamic switching Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle (MIT Alewife’s node), Rhamma Processor Switch-on-use (lazy strategy of switch-on-cache-miss) Wait until last minute Valid bit needed for each register Clear when load issued, set when data returned Switch-on-signal (e.g. interrupt) Predicated switch instruction based on conditions No need to support a large number of threads
42
21 Simultaneous Multithreading (SMT)
SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92] Intel’s HyperThreading (2-way SMT) IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores; Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) Fetch Decode RS & ROB FMult Unit plus (4 cycles) Physical Reg RegReg RegisterRegister Register FAdd FileRegReg RegisterRegister (2 cyc) FileFileRegReg PCPC RRenameenameRegisterRegisterrr File FileFileReg PCPC RRenameenameRegisterRegisterrr FileFile PCPC RRenameenamerr File
PCPC RRenameenamerr ALU1
I-CACHE ALU2 Load/Store D-CACHE (variable)
43
Instruction Fetching Policy
FIFO, Round Robin, simple but may be too naive Adaptive Fetching Policies BRCOUNT (reduce wrong path issuing) Count # of br inst in decode/rename/IQ stages Give top priority to thread with the least BRCOUNT MISSCOUT (reduce IQ clog) Count # of outstanding D-cache misses Give top priority to thread with the least MISSCOUNT ICOUNT (reduce IQ clog) Count # of inst in decode/rename/IQ stages Give top priority to thread with the least ICOUNT IQPOSN (reduce IQ clog) Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues Due to that threads with the oldest instructions will be most prone to IQ clog No Counter needed
44
22 Resource Sharing
Could be tricky when threads compete for the resources
Static Less complexity Could penalize threads (e.g. instruction window size) P4’s Hyperthreading
Dynamic Complex What is fair? How to quantify fairness?
A growing concern in Multi-core processors Shared L2, Bus bandwidth, etc. Issues Fairness Mutual thrashing
45
HyperThreading Hyper-threading
2 CPU Without Hyper-threading 2 CPU With Hyper-threading
Arch State Arch State Arch State Arch State Arch State Arch State
Processor Processor Processor Processor Execution Execution Execution Execution Resources Resources Resources Resources
• Implementation of Hyper-threading adds less than 5% to the chip area • Principle: share major logic components (functional units) and improve utilization • Architecture State: All core pipeline resources needed for executing a thread
46 19 (19)
23
Multithreading with ILP: Examples
(20)
10 P4 HyperThreading Resource Partitioning
TC (or UROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss op queue (into ½) after fetched from TC ROB (126/2) LB (48/2) SB (24/2) (32/2 for Prescott) General op queue and memory op queue (1/2) TLB (½?) as there is no PID Retirement: alternating between 2 logical processors
47
Alpha 21464 (EV8) Processor
Technology
Leading edge process technology – 1.2 ~ 2.0GHz 0.125µm CMOS SOI-compatible Cu interconnect low-k dielectrics
Chip characteristics ~1.2V Vdd ~250 Million transistors ~1100 signal pins in flip chip packaging
48
24 Alpha 21464 (EV8) Processor
Architecture
Enhanced out-of-order execution (that giant 2Bc-gskew predictor we discussed before is here) Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-way SMP 8-wide superscalar 4-way simultaneous multithreading (SMT) Total die overhead ~ 6% (allegedly)
49
SMT Pipeline
Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer
PC
Register Map Regs Dcache Regs Icache
50
Source: A company once called Compaq
25 EV8 SMT
In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB Replicated hardware contexts Program counter Architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool) Shared resources Rename register pool (larger than needed by 1 thread) Instruction queue Caches TLB Branch predictors Deceased before seeing the daylight.
51
Reality Check, circa 200x
Conventional processor designs run out of steam Power wall (thermal) Complexity (verification) Physics (CMOS scaling)
1000 Sun’s Surface
Nuclear Reactor Rocket 100 Nozzle
2 Hot plate
Pentium III ® processor “Surpassed hot-plate power processor
Watts/cm Pentium II ® 10 density in 0.5m; Not too long to reach nuclear reactor,” Pentium Pro ® processor Former Intel Fellow Fred i386 Pentium ® processor Pollack. i486 1
52
26 Latest Power Density Trend
Yeo and Lee, “Peeling the Power Onion of Data Centers,” In Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011
53
Reality Check, circa 200x
Conventional processor designs run out of steam Power wall (thermal) Complexity (verification) Physics (CMOS scaling) Unanimous direction Multi-core Simple cores (massive number) Keep Wire communication on leash Gordon Moore happy (Moore’s Law) Architects’ menace: kick the ball to the other side of the court? What do you (or your customers) want? Performance (and/or availability) Throughput > latency (turnaround time) Total cost of ownership (performance per dollar) Energy (performance per watt) Reliability and dependability, SPAM/spy free
54
27 Multi-core Processor Gala
55
Intel’s Multicore Roadmap
8C 12MB 8C 12MB shared shared (45nm) (45nm) QC 8/16MB DC 3MB /6MB shared shared (45nm) DC 3 MB/6 MB shared QC 4MB (45nm) DC 4MB DC 2/4MB DC 16MB DC 2/4MB shared shared DC 2MB DC 4MB SC 1MB DC 2MB DC 2/4MB
SC 512KB/
Mobile processors Mobile processors Enterprise
Desktop processorsDesktop 1/ 2MB
2006 2007 2008 2006 2007 2008 2006 2007 2008
Source: Adapted from Tom’s Hardware
To extend Moore’s Law To delay the ultimate limit of physics By 2010 all Intel processors delivered will be multicore Intel’s 80-core processor (FPU array)
56
28 Is a Multi-core really better off?
If you were plowing a field, which would you rather use: Two strong oxen or 1024 cores? --- Seymour Cray
Well, it is hard to say in Computing World
57
Q1. For a PIPT cache with virtual memory support, three possible events can be triggered during an instruction fetch: (1) a cache lookup, (2) a TLB miss, (3) a page fault. Please order these events in the correct order of their occurrences.
(2) (3) (1): In a PIPT cache, address translation will take place prior to a cache lookup. It first searches for a match in TLB, therefore, a TLB miss (if any) will take place first. Then a page table walk is initiated. If a page has not been allocated, a page fault will follow. The OS will then allocate the page, fill in the page table entry, then fill the translation into TLB, followed by a cache lookup.
Q2. Given a 256Meg x4 DRAM chip which consists of 2 banks, with 14-bit row addresses. (256Meg indicates the number of addresses.) What is the row buffer size for each bank?
256M 28 address bits needed. One bit is used for bank index, hence, the column address = 28 – 1 – 14 = 13 bits As the DRAM is a “x4” configuration One row buffer of a bank = 213 * (x4 bits) = 32 kbits = 4 KB
58
29 Q3. Assume an Inverted Page Table (8-entry IPT) is used by a 32-bit OS. The memory page size is 256KB. The complete IPT content is shown below. The Physical Page Number (PPN) starts from 0 to 7 from the top of the table. There are three active processes, P1 (PID=1), P2 (PID=2) and P3 (PID=3) running in the system and the IPT holds the translation for the entire physical memory. Answer the following questions.
Based on the size of the Inverted Page Table above, what is the size of the physical memory ? There are 8 entries in the IPT. As each page is 256KB, the size of the physical memory = 8 * 256KB = 2MB
59
IBM Watson Jeopardy! Competition POWER7 chips (2,880 cores) + 16TB memory Massively parallel processing Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction
60
30 Major Challenges for Multi-Core Designs
Communication Memory hierarchy Data allocation (you have a large shared L2/L3 now) Interconnection network AMD HyperTransport Intel QPI Scalability Bus Bandwidth, how to get there? Power-Performance — Win or lose? Borkar’s multicore arguments 15% per core performance drop 50% power saving Giant, single core wastes power when task is small How about leakage? Process variation and yield Programming Model
61
31