Computer Organization and Components

2 Computer Organization and Course Structure Components Module 1: C and Assembly Module 4: Processor Design Programming IS1500, fall 2015 LE1 LE2 LE3 EX1 LAB1 LE9 LE10 S2 LAB5 Lecture 12: Parallelism, Concurrency, Speedup, and ILP LE4 S1 LAB2 David Broman Module 2: I/O Systems Module 5: Memory Hierarchy Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley LE5 LE6 EX2 LAB3 LE11 EX4 LAB6 Module 3: Logic Design Module 6: Parallel Processors and Programs LE7 LE8 EX3 LAB4 LE12 LE13 EX5 S3 Proj. Expo LE14 Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level Slides version 1.0 [email protected] Concurrency, and Speedup Parallelism 3 4 Abstractions in Computer Systems Agenda Computer System Networked Systems and Systems of Systems Application Software Software Operating System Part I Part II Multiprocessors, Parallelism, Instruction Level Parallelism Instruction Set Architecture Hardware/Software Interface Concurrency, and Speedup Microarchitecture Logic and Building Blocks Digital Hardware Design Digital Circuits Analog Circuits Analog Design and Physics Devices and Physics Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 5 6 How is this computer revolution possible? (Revisited) Part I Moore’s law: • Integrated circuit resources (transistors) Multiprocessors, Parallelism, double every 18-24 months. Concurrency, and Speedup • By Gordon E. Moore, Intel’s co-founder, 1960s. • Possible because refined manufacturing process. E.g., 4th generation Intel Core i7 processors uses 22nm manufacturing. • Sometimes considered a self-fulfilling prophecy. Served as a goal for the semiconductor industry. Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 7 8 Have we reached the limit? (Revisited) What is a multiprocessor? A multiprocessor is a computer By contrast, a computer with one During the last decade, the clock rate has processor is called a uniprocessor. Why? increased dramatically. system with two or more processors. • 1989: 80486, 25MHz The Power Wall • 1993: Pentium, 66Mhz • 1997: Pentium Pro, 200MHz Multicore microprocessors are • 2001: Pentium 4, 2.0 GHz multiprocessors where all processors (cores) • 2004: Pentium 4, 3.6 GHz are located on a single integrated circuit. by Eric Gaba, CC BY-SA 3.0. No modifications made. 2013: Core i7, 3.1 GHz - 4 GHz http://www.publicdomainpictures.net/view-image.php? image=1281&picture=tegelvagg A cluster is a set of computers that are Increased clock rate “New” trend since 2006: Multicore connected over a local area network (LAN). implies increased power • Moore’s law still holds May be viewed as one large multiprocessor. • More processors on a chip: multicore We cannot cool the system enough to • “New” challenge: parallel programming Photo by Robert Harker increase the clock rate anymore… Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 9 10 Different Kinds of Computer Systems (Revisited) Why multiprocessors? Possible to execute many Performance computation tasks in parallel. Replace energy inefficient Multiprocessor Energy processors in data centers with many efficient smaller Photo by Kyro Photo by Robert Harker processors. Embedded Personal Computers and Warehouse Real-Time Systems Personal Mobile Devices Scale Computers If one out of N processors fails, still Dependability N-1 processors are functioning. Dependability Energy Performance Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 11 12 Parallelism and Concurrency – what is the difference? Speedup Execution time of Concurrency is about handling many things at the same time. one program before How much can we improve the Concurrency may be viewed from the software viewpoint. Tbefore improvement performance using parallelization? Speedup = Tafter Execution time after Software improvement Parallelism is about Sequential Superlinear speedup. Either wrong, doing (executing) Concurrent Speedup or due to e.g. cache effects. many things at the Example: matrix Example: A Linux same time. Parallelism 4 multiplication on a OS running on a Linear speedup may be viewed from Serial unicore processor. unicore processor. 3 (or ideal speedup) the hardware Still increased speedup, but viewpoint. 2 less efficient Note: As always, everybody does Hardware Example: matrix Example: A Linux OS 1 not agree on the definitions of multiplication on a running on a multicore Danger: Relative speedup concurrency and parallelism. The multicore processor. processor . Parallel measures only the same program matrix is from H&P 2014 and the informal definitions above are 1 2 3 4 Number of True speedup compares also with similar to what was said in a talk processors the best known sequential program, by Rob Pike. Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 13 14 E Amdahl’s Law (1/4) Amdahl’s Law (2/4) Can we achieve linear speedup? Time affected by the Time unaffected of Tbefore Tbefore Divide execution time before improvement of improvement Speedup = = parallelization (sequential part) T T improvement into two parts. after affected + Tunaffected N T = Taffected + Tunaffected Exercise: Assume a program consists Solution: Execution time after of an image analysis task, sequentially 4 = 80 / (60 / N + 80 – 60) improvement T followed by a statistical computation affected T task. Only the image analysis task can Tafter = + unaffected Amount of improvement 60/N + 20 = 20 N (N times improvement) be parallelized. How much do we need to improve the image analysis task to 60/N = 0 T T be able to achieve 4 times speedup? before before Speedup = = This is sometimes referred It is impossible to achieve this T Assume that the program takes 80ms Tafter affected to as Amdahl’s law speedup! + Tunaffected in total and that the image analysis task N takes 60ms out of this time. Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 15 16 E E Amdahl’s Law (3/4) Amdahl’s Law (4/4) Example continued. What if we change the size of the Tbefore Tbefore Speedup = = problem (make the matrices larger)? T T after affected + Tunaffected N Number of processors Assume that we perform 10 scalar 10 40 But was not the maximal Solution A: integer additions, followed by one matrix speedup 11 when N à infinity? addition, where matrices are 10x10. (10+10*10) / (10*10/10 + 10) = 5.5 Speedup Speedup 5.5 8.8 Assume additions take the same amount 10x10 Strong scaling = measuring of time and that we can only parallelize Solution B: speedup while keeping the the matrix addition. (10+10*10) / (10*10/40 + 10) = 8.8 problem size fixed. Exercise A: What is the speedup with 10 processors? Solution C: Weak scaling = measuring Speedup Speedup (10+10*10) / ((10*10)/N + 10) = 11 when Size of matrices Exercise B: What is the speedup with 8.2 20.5 speedup when the problem 40 processors? N à infinity 20x20 size grows proportionally to Exercise C: What is the maximal the increased number of speedup (the limit when N à infinity) processors. Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 17 18 E Main Classes of Parallelisms SISD, SIMD, and MIMD An old (from the 1960s) but still very useful classification of Example – Sheep shearing processors uses the notion of instruction and data streams. Assume that sheep are data Data-Level Parallelism (DLP) items and the task for the farmer Data-level parallelism. Examples DLP Many data items can be is to do sheep shearing (remove Data Stream are multimedia extensions (e.g., processed at the same time. the wool). Data-level parallelism SSE, streaming SIMD would be the same as using Single Multiple extension), vector processors. several farm hands to do the SISD SIMD Graphical Unit Processors shearing. E.g. Intel (GPUs) are both SIMD and Single E.g. SSE Pentium 4 MIMD Example – Many tasks at the farm Instruction in x86 Task-Level Parallelism (TLP) Assume that there are many different Task-level parallelism. Examples are multicore and TLP Different tasks of work that can things that can be done on the farm MISD MIMD cluster computers work in independently and in (fix the barn, sheep shearing, feed the parallel pigs etc.) Task-level parallelism would No examples today E.g. Intel Physical Q/A be to let the farm hands do the Instruction Stream Core i7 Multiple What is a modern Intel CPU, different tasks in parallel. such as Core i7? Stand for MIMD, on the table for SIMD Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 19 20 What is Instruction Level Parallelism? Instruction Level Parallelism (ILP) may increase Part II performance without involvement of the programmer.

Computer Organization and Components

UNICORE OPTIMIZATION William Jalby

Computer Architecture Techniques for Power-Efficiency

Installation Guide for UNICORE Server Components

Intel's High-Performance Computing Technologies

Implementing Production Grids William E

UNICORE D2.3 Platform Requirements

DC DMV Communication Related to Reinstating Suspended Driver Licenses and Driving Privileges (As of December 10, 2018)

Power and Energy Characterization of an Open Source 25-Core Manycore Processor

MISP Objects

Horus: Large-Scale Symmetric Multiprocessing for Opteron Systems

Embedded Multicore: an Introduction

Certification of Avionics Applications on Multi-Core Processors: Opportunities and Challenges