Introduction to Multi-Core Architecture (Textbook - Chapter 1)

The University of Texas at Arlington Lecture 3 CSE 5343/4342 Embedded Systems II Based heavily on slides by Dr. Roger Walker Introduction to Multi-Core Architecture (Textbook - Chapter 1) 2 1 Concurrency in Software • Most naïve users think of computer behavior as simple (e.g., just browsing the network) • However and meanwhile, the computer OS has many-many processes running, many of them multi-threaded (user interface in foreground processing in the back). • Concurrency allows multiple processes or threads (codes representing different subsystems ) to be in execution state 3 Parallel Computing • Allows simultaneous execution of threads or processes – not same thing as concurrent execution. • Concurrency once it is performed simultaneously is parallel computing. • Computer Architectures can be classified in two different dimensions, the number of instruction streams that can be processed at any given time, and the number of data streams that can be processed at any given time. 4 2 Flynn’s Taxonomy • Single Instruction Single Data (SISD) • Multiple Instruction Single Data (MISD) – not practical • Single Instruction Multiple Data (SIMD) • Multiple Instruction Multiple Data (MIMD) – most modern architectures 5 Moore’s Law The number of transistors on integrated circuits is doubling every (18) 24 months. 6 3 Parallel Computing in Microprocessors • Many people mistakenly think the law is a predictor of clock speed as it used to follow Moore’s law (0.1MHz to 3GHz). • Parallelism can more easily (than increasing clock rates) use more transistors: – Instruction Level Parallelization (ILP) – dynamic, out of order processing at the hardware level avoiding pipeline stalls – Superscalar processors can use non-overlapping components of the CPU to do ILP (multiple pipelines) • Multiple processes or threads are running on a processor today (at the software level): – concurrent thread processing (preemptive) – simultaneous thread processing (multiple processors, HT) which can easily make use of hardware parallelism or ILP 7 Threads • A Thread is a discrete sequence of related instructions that is executed independently of other instruction sequences. • Hardware Level Definition: A thread is an execution path that remains independent of other hardware execution paths. • OS maps software threads to hardware execution • A thread only needs the architecture space – registers, execution units, etc. • Logical processors can be created by duplicating the architecture space. 8 4 What Are Threads Good For? • Improving responsiveness of GUIs • Making programs easier to understand • Overlapping computation and I/O • Improving performance through parallel execution – if computer architecture allows 9 Thread Concurrency vs. Parallelism Concurrency: two or more threads are in progress at the same time: Thread 1 Thread 2 Parallelism: two or more threads are executing at the same time Thread 1 Thread 2 Multiple cores needed 10 5 Thread Level Parallelism • Time-sliced multi-threading – single processor • Simultaneous multithreading: with superscalar support (see next slide) • Multiple processors – multiple threads or processes run simultaneously on multiple processors • Physical processor – includes many resources including architecture state (registers, caches, execution units, etc.) 11 Simultaneous Multi-Threading • Simultaneous multi-threading or SMT - The actual execution units shared by the different logical processors. • Intel’s implementation called Hyper- threading or HT • To the OS (e.g., Linux or Windows) the computing unit appears as multiple physical processors and threads scheduled accordingly. 12 6 Intel’s HT Technology Implementation • Two logical processors: – Architecture state and On-DieOn-Die APIC1 duplicated CacheCache • Shared execution units: cache, branch ArchitectureArchitecture StateState ArchitectureArchitecture StateState predictors, control logic Adv.Adv. ProgrammableProgrammable Adv.Adv. ProgrammableProgrammable and buses InterruptInterrupt ControlControl InterruptInterrupt ControlControl ProcessorProcessor 1 APIC: Advanced Programmable ExecutionExecution Interrupt Controller; handles interrupts Resource sent to a specified logical processor. Resource System Bus 13 CPU Architectures From Text-book 14 7 HT, Single Processor, Multiple Processor Single Processor Thread 1 Idle Dual Processors Thread 2 HT Processor 15 Understanding Performance • Parallel processing speedup: Speed Up = Time_best_sequential / Time_parallel_implementation • Amdahl’s Law on speeding up a fraction of execution: Total Speed Up = 1 / [1-Fraction_Enhanced + (Fraction_Enchanced/Speedup )] 16 8 Amdahl Example • Amdahl’s Law on speeding up a fraction of execution: Total Speed Up = 1 / [1-Fraction_Enhanced + (Fraction_Enchanced/Speedup)] • Examples: Speedup half the program by 15% using parallel processing, then Total Speedup = 1/[(1-0.5)+(.5/1.15)] = 1.08 Thus the whole program speedup is 8 percent. 17 Amdahl to Parallel • Expressing in terms of the serial and parallel portions: Speedup = 1/[S + (1-S)/n] Where S is the time spent executing the serial portion of program and n is the number of execution cores • If n = 1, then there is no speedup • As n increases, the speedup approaches 1/S (e.g., if S is 10% then maximum speedup is 10 fold) 18 9 Parallel Code vs. Parallel Processors • For 2 cores and a 30% parallelized program: • Speedup with 2 cores: 1/(.7 + .3/2) ~ 85% of execution time • Speedup with 4 cores: 1/(.7+.3/4) ~ 77.5% of execution time • Doubling the parallel portion with two cores: 1/(0.4+ .6/2) ~ 70% of execution time • Thus only when the program is highly parallelized does adding more processors really help • (Calculations assume that the parallel portion is arbitrarily parallel) 19 Amdahl’s Law with Multi-core Threading Overhead • H(n) – actual OS system overhead + inter-thread activities (synchronization and other forms of communication). • Speedup = 1/(S+(1-S)/n + H(n)) – can be less than 1 if H(n) high • Keep H(n) due to threading small 20 10 Amdahl’s Law with HT • In HT, although multiple threads are running simultaneously each one of them runs slower. If each thread runs approximately k percent slower then: – Speedup=1/[S+(1-k)(1-S)/n+H(n)] • E.g., if threads run 30% slower then: – Speedup=1/[S+0.67(1-S)/n+H(n)] 21 Amdahl’s Law – Assumptions Made • Best performing serial algorithm is limited by available of CPU cycles (no I/O). However, a multi-core processor with separate caches, will reduce memory latency. • Serial algorithm is best possible solution. However, some problems may work better (larger speed up) with parallel algorithms. • As number of core increases, the problem stays the same. However, with more resources, problem grows to meet available resources 22 11 Gustafson’s Law • Gustafson’s achieved near linear speedup using 1,024 hypercube at Sandia in late 1980’s. • Gustafson’s Law: – Scaled speedup = N - (N-1)*s s is the non-parallelizable part of the process Addresses last assumption and fixed problem size in Amdahl’s law by assuming fixed time concept which leads to scaled speed up. 23 Summary • Concurrency refers to the notion of multiple threads in progress at the same time. • Parallelism refers to the concept of multiple threads executing simultaneously. • Modern software applications often consist of multiple processes or threads that can be executed in parallel. • Most modern computing platforms are multiple instruction, multiple data (MIMD) machines. These machines allow programmers to process multiple instruction and data streams simultaneously. 24 12 Summary • In practice, Amdahl's Law does not accurately reflect the benefit of increasing the number of processor cores on a given platform. • Linear speedup is achievable by expanding the problem size with the number of processor cores. 25 13.

Load more