Multicore Architecture

EEC 581 Computer Architecture Multicore Architecture Department of Electrical Engineering and Computer Science Cleveland State University Multiprocessor Architectures Late 1950s - one general-purpose processor and one or more special-purpose processors for input and output operations Early 1960s - multiple complete processors, used for program-level concurrency Mid-1960s - multiple partial processors, used for instruction-level concurrency Single-Instruction Multiple-Data (SIMD) machines Multiple-Instruction Multiple-Data (MIMD) machines A primary focus of this chapter is shared memory MIMD machines (multiprocessors) 1 Thread Level Parallelism (TLP) • Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads (3) InstructionInstr anducti Dataon a Streamsnd Data Streams • Taxonomy due to M. Flynn Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: MIMD: No examples today Intel Xeon e5345 Example: Multithreading (MT) in a single address space 3 (4) Recall ExecutableRecall Th eFormat Executable Format 2 Object file ready to be linked and loaded header text An executable static data Linker Loader instance or reloc Process symbol table debug Static Libraries What does a loader do? (5) 4 2 Process • A process is a running program DLL’s with state ! Stack, memory, open files Stack ! PC, registers • The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details (6) 3 Process Level Parallelism Process Level Parallelism Recall The Executable Format Process Process Process Object file ready to be linked and loaded header text An executable static data Linker Loader instance or reloc Process symbol table debug Static Libraries • Parallel processes and throughput computing What does a loader do? • Each process itself does not run any faster (7) 5 (5) Process From Processes to Threads Process • Switching processes on a core is expensive • A pr!o ceA sslot iosf staa rteu ninnfionrmg aprtioongr toa bem managed DLL’s w• ithIf staI wtean t concurrency, launching a process is ! Setaxpeck,n msievmeo ry, open files Stack ! PC, registers • How about splitting up a single process into • Thepa operalrlealti congm sypustetatimo nks?ee ps tracks of the state of all processors " Lightweight processes or threads! ! E.g., for scheduling processes Heap • There many processes for the same application Static data ! E.g., web browser Code • Operating systems class for details (8) (6) 6 4 3 3 Categories of Concurrency Categories of Concurrency: Physical concurrency - Multiple independent processors ( multiple threads of control) Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control) Coroutines (quasi-concurrency) have a single thread of control A thread of control in a program is the sequence of program points reached as control flows through the program Motivations for the Use of Concurrency Multiprocessor computers capable of physical concurrency are now widely used Even if a machine has just one processor, a program written to use concurrent execution can be faster than the same program written for nonconcurrent execution Involves a different way of designing software that can be very useful—many real-world situations involve concurrency Many program applications are now spread over multiple machines, either locally or over a network 4 Introduction to Subprogram-Level Concurrency A task or process or thread is a program unit that can be in concurrent execution with other program units Tasks differ from ordinary subprograms in that: A task may be implicitly started When a program unit starts the execution of a task, it is not necessarily suspended When a task’s execution is completed, control may not return to the caller Tasks usually work together Two General Categories of Tasks Heavyweight tasks execute in their own address space Lightweight tasks all run in the same address space – more efficient A task is disjoint if it does not communicate with or affect the execution of any other task in the program in any way 5 Task Synchronization A mechanism that controls the order in which tasks execute Two kinds of synchronization Cooperation synchronization Competition synchronization Task communication is necessary for synchronization, provided by: - Shared nonlocal variables - Parameters - Message passing Kinds of synchronization Cooperation: Task A must wait for task B to complete some specific activity before task A can continue its execution, e.g., the producer-consumer problem Competition: Two or more tasks must use some resource that cannot be simultaneously used, e.g., a shared counter Competition is usually provided by mutually exclusive access (approaches are discussed later) 6 Process Level Parallelism Process Process Process • Parallel processes and throughput computing • Each process itself does not run any faster (7) From ProcessesThre atod ThreadsParallel Execution Process From Processes to Threads thread • Switching processes on a core is expensive ! A lot of state information to be managed • If I want concurrency, launching a process is expensive • How about splitting up a single process into parallel computations? " Lightweight processes or threads! (9) 13 (8) A Thread A Thread • A separate, concurrently executable instruction 4 stream within a process • Minimum amount state to execute on a core ! Program counter, registers, stack Our ! Remaining state shared with the parent process datapath o Memory and files so far! • Support for creating threads • Support for merging/terminating threads • Support for synchronization between threads ! In accesses to shared data (10) 14 7 5 TLP ILP of a single program is hard Large ILP is Far-flung We are human after all, program w/ sequential mind Reality: running multiple threads or programs Thread Level Parallelism Time Multiplexing Throughput computing Multiple program workloads Multiple concurrent threads Helper threads to improve single program performance 15 Thread Level Parallelism (TLP) Thread Level Parallelism (TLP) • Multiple threads of execution • Exploit ILP in each thread • Exploit concurrent execution across threads 16 (3) 8 Instruction and Data Streams • Taxonomy due to M. Flynn Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: MIMD: No examples today Intel Xeon e5345 Example: Multithreading (MT) in a single address space (4) 2 Single and Multithreaded Processes 17 A Simple Example A Simple Example Data Parallel Computation 18 (11) 9 Thread Execution: Basics Thread #1 PC, registers, Stack stack pointer Thread #2 create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer funcA() funcB() Heap end_thread() end_thread() Static data WaitAllThreads() funcA() funcB() (12) 6 Examples of Threads A web browser A Simple Example One thread displays images One thread retrieves data from network A word processor One thread displays graphics One thread reads keystrokes One thread performs spell checking in the backgroundData Parallel A web server Computation One thread accepts requests When a request comes in, separate thread is created to service Many threads to support thousands of client requests RPC or RMI (Java) One thread receives message Message service uses another thread (11) 19 Thread Execution Thread Execution: Basics Thread #1 PC, registers, Stack stack pointer Thread #2 create_thread(funcB) PC, registers, Stack create_thread(funcA) stack pointer funcA() funcB() Heap end_thread() end_thread() Static data WaitAllThreads() funcA() funcB() (12) 20 106 ThreadsThre aExecutionds Executi onon ao nSingle a Sin glCoree Co re • Hardware threads ! Each thread has its own hardware state • Switching between threads on each cycle to share the core pipeline – why? Thread #1 lw $t0, label($0) lw $t1, label1($0) IF ID EX MEM and $t2, $t0, $t1 WB andi $t3, $t1, 0xffff srl $t2, $t2, 12 lw …… Interleaved execution lw lw Improve utilization ! Thread #2 lw lw lw lw $t3, 0($t0) add $t2, $t2, $t3 add lw lw lw addi $t0, $t0, 4 and add lw lw lw addi $t1, $t1, -1 bne $t1, $zero, loop No pipeline stall on load-to-use hazard! ……. (13) 21 ExecutionExe cuModel:tion MMultithreadingodel: Multith reading An Example Datapath • Fine-grain multithreading ! Switch threads after each cycle ! Interleave instruction execution • Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss) ! Simplifies hardware, but does not hide short stalls (e.g., data hazards) ! If one thread stalls (e.g., I/O), others are executed 22 From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU (14) (17) 11 7 Simultaneous Multithreading • In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads ! Schedule instructions from multiple threads ! Instructions from independent threads execute when function units are available • Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function units and caches ! Known as Hyperthreading in Intel terminology (18) 9 Threads vs. Processes Thread Processes A thread has no data A process has segment or heap code/data/heap and other A thread cannot live on its segments own, it must live within

Load more