2 Computer Organization and Course Structure

Components Module 1: C and Assembly Module 4: Design Programming IS1500, fall 2015 LE1 LE2 LE3 EX1 LAB1 LE9 LE10 S2 LAB5 Lecture 12: Parallelism, Concurrency, Speedup, and ILP LAB2 LE4 S1 David Broman Module 2: I/O Systems Module 5: Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley LE5 LE6 EX2 LAB3 LE11 EX4 LAB6

Module 3: Logic Design Module 6: Parallel Processors and Programs LE7 LE8 EX3 LAB4 LE12 LE13 EX5 S3

Proj. Expo LE14

Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level Slides version 1.0 [email protected] Concurrency, and Speedup Parallelism

3 4

Abstractions in Computer Systems Agenda

Computer System Networked Systems and Systems of Systems

Application Software Software Operating System Part I Part II

Multiprocessors, Parallelism, Instruction Level Parallelism Instruction Set Architecture Hardware/Software Interface Concurrency, and Speedup

Microarchitecture

Logic and Building Blocks Digital Hardware Design

Digital Circuits

Analog Circuits Analog Design and Physics Devices and Physics

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 5 6 How is this computer revolution possible? (Revisited)

Part I Moore’s law: • resources (transistors) Multiprocessors, Parallelism, double every 18-24 months. Concurrency, and Speedup • By Gordon E. Moore, Intel’s co-founder, 1960s.

• Possible because refined manufacturing . E.g., 4th generation Intel Core i7 processors uses 22nm manufacturing.

• Sometimes considered a self-fulfilling prophecy. Served as a goal for the semiconductor industry. Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

7 8 Have we reached the limit? (Revisited) What is a multiprocessor?

A multiprocessor is a computer By contrast, a computer with one During the last decade, the has processor is called a uniprocessor. Why? increased dramatically. system with two or more processors. • 1989: 80486, 25MHz The Power Wall • 1993: Pentium, 66Mhz • 1997: Pentium Pro, 200MHz Multicore are • 2001: Pentium 4, 2.0 GHz multiprocessors where all processors (cores) • 2004: Pentium 4, 3.6 GHz are located on a single integrated circuit.

by Eric Gaba, CC BY-SA 3.0. No modifications made. 2013: Core i7, 3.1 GHz - 4 GHz

http://www.publicdomainpictures.net/view-image.php? image=1281&picture=tegelvagg A cluster is a set of computers that are Increased clock rate “New” trend since 2006: Multicore connected over a local area network (LAN). implies increased power • Moore’s law still holds May be viewed as one large multiprocessor. • More processors on a chip: multicore We cannot cool the system enough to • “New” challenge: parallel programming Photo by Robert Harker increase the clock rate anymore…

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 9 10 Different Kinds of Computer Systems (Revisited) Why multiprocessors?

Possible to execute many Performance computation tasks in parallel.

Replace energy inefficient Multiprocessor Energy processors in data centers with many efficient smaller Photo by Kyro Photo by Robert Harker processors. Embedded Personal Computers and Warehouse Real-Time Systems Personal Mobile Devices Scale Computers If one out of N processors fails, still Dependability N-1 processors are functioning. Dependability Energy Performance

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

11 12 Parallelism and Concurrency – what is the difference? Speedup Execution time of Concurrency is about handling many things at the same time. How much can we improve the one program before Concurrency may be viewed from the software viewpoint. Tbefore improvement performance using parallelization? Speedup = Tafter Execution time after Software improvement Parallelism is about Sequential Superlinear speedup. Either wrong, doing (executing) Concurrent Speedup or due to e.g. effects.

many things at the Example: matrix Example: A same time. Parallelism 4

multiplication on a OS running on a Linear speedup may be viewed from Serial unicore processor. unicore processor. 3 (or ideal speedup) the hardware Still increased speedup, but viewpoint. 2

less efficient Note: As always, everybody does Hardware Example: matrix Example: A Linux OS 1 not agree on the definitions of multiplication on a running on a multicore Danger: Relative speedup concurrency and parallelism. The multicore processor. processor .

Parallel measures only the same program matrix is from H&P 2014 and the informal definitions above are 1 2 3 4 Number of True speedup compares also with similar to what was said in a talk processors the best known sequential program, by Rob Pike. Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 13 14 E Amdahl’s Law (1/4) Amdahl’s Law (2/4)

Can we achieve linear speedup? Time affected by the Time unaffected of Tbefore Tbefore Divide execution time before improvement of improvement Speedup = = parallelization (sequential part) T improvement into two parts. Tafter affected + Tunaffected N T = Taffected + Tunaffected Exercise: Assume a program consists Solution: Execution time after of an image analysis task, sequentially 4 = 80 / (60 / N + 80 – 60) improvement T followed by a statistical computation affected T task. Only the image analysis task can Tafter = + unaffected Amount of improvement 60/N + 20 = 20 N (N times improvement) be parallelized. How much do we need to improve the image analysis task to 60/N = 0 T T be able to achieve 4 times speedup? before before Speedup = = This is sometimes referred It is impossible to achieve this T Assume that the program takes 80ms Tafter affected to as Amdahl’s law speedup! + Tunaffected in total and that the image analysis task N takes 60ms out of this time.

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

15 16 E E Amdahl’s Law (3/4) Amdahl’s Law (4/4)

Example continued. What if we change the size of the Tbefore Tbefore Speedup = = problem (make the matrices larger)? T Tafter affected + Tunaffected N Number of processors Assume that we perform 10 scalar 10 40 But was not the maximal Solution A: integer additions, followed by one matrix speedup 11 when N à infinity? addition, where matrices are 10x10. (10+10*10) / (10*10/10 + 10) = 5.5 Speedup Speedup 5.5 8.8

Assume additions take the same amount 10x10 Strong scaling = measuring of time and that we can only parallelize Solution B: speedup while keeping the the matrix addition. (10+10*10) / (10*10/40 + 10) = 8.8 problem size fixed. Exercise A: What is the speedup with 10 processors? Solution C: Weak scaling = measuring Speedup Speedup

(10+10*10) / ((10*10)/N + 10) = 11 when matrices of Size Exercise B: What is the speedup with 8.2 20.5 speedup when the problem 40 processors? N à infinity 20x20 size grows proportionally to Exercise C: What is the maximal the increased number of speedup (the limit when N à infinity) processors.

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 17 18 E Main Classes of Parallelisms SISD, SIMD, and MIMD

An old (from the 1960s) but still very useful classification of Example – Sheep shearing processors uses the notion of instruction and data streams. Assume that sheep are data Data-Level Parallelism (DLP) items and the task for the farmer Data-level parallelism. Examples DLP Many data items can be is to do sheep shearing (remove Data Stream are multimedia extensions (e.g., processed at the same time. the wool). Data-level parallelism SSE, streaming SIMD would be the same as using Single Multiple extension), vector processors.

several farm hands to do the SISD SIMD Graphical Unit Processors shearing. E.g. Intel (GPUs) are both SIMD and Single E.g. SSE Pentium 4 MIMD Example – Many tasks at the farm Instruction in Task-Level Parallelism (TLP) Assume that there are many different Task-level parallelism. Examples are multicore and TLP Different tasks of work that can things that can be done on the farm MISD MIMD cluster computers work in independently and in (fix the barn, sheep shearing, feed the parallel pigs etc.) Task-level parallelism would No examples today E.g. Intel Physical Q/A be to let the farm hands do the Stream Instruction Core i7

Multiple What is a modern Intel CPU, different tasks in parallel. such as Core i7? Stand for MIMD, on the table for SIMD

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

19 20

What is Instruction Level Parallelism?

Instruction Level Parallelism (ILP) may increase Part II performance without involvement of the programmer. It may be implemented in a SISD, SIMD, and MIMD computer. Instruction Level Parallelism Two main approaches: 1. Deep pipelines with more pipeline stages If the length of all pipeline stages are balanced, we may increase performance by increasing the clock speed.

2. Multiple issue A technique where multiple instructions are issued in each in cycle. ILP may decrease the CPI to lower than 1, or using the inverse metric instructions per Acknowledgement: The structure and several of the good examples are derived from the book “Computer Organization and Design” (2014) by David A. Patterson and John L. Hennessy clock cycle (IPC) increase it above 1.

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 21 22 Static Multiple Issue (1/3) Multiple Issue Approaches VLIW

Determining how many and which instructions to Very Long Instruction Word (VLIW) issue in each cycle. Problematic since instructions An issue package may be processors issue several instructions in each seen as one large instruction Problems typically depend on each other. cycle using issue packages. with multiple operations. with Multiple Issue How to deal with data and control hazards. add $s0, $s1, $s2 F D E M W Each issue package has two issue slots. add $t0, $t0, $0 F D E M W The two main approaches of multiple issue are 1. Static Multiple Issue and $t2, $t1, $s0 F D E M W Decisions on when and which instructions to issue at lw $t0, 0($s0) F D E M W each clock cycle is determined by the compiler. The compiler may insert 2. Dynamic Multiple Issue no-ops to avoid hazards. Many of the decisions of issuing instructions are made How is VLIW affecting the by the processor, dynamically, during execution. hardware implementation?

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

23 24 Static Multiple Issue (2/3) Static Multiple Issue (3/3) E Changing the hardware VLIW

To issue two instructions in each cycle, we Fetch and Double the number Add another Exercise: Assume we have a 5-stage VLIW addi $s1, $0, 10 need the following (not shown in picture): decode 64-bit of ports for the ALU L1: lw $t0, 0($s2) (two instructions) MIPS processor that can issue two instructions CLK ori $t1, $t0, 7 CLK CLK CLK in the same clock cycle. For the following code, CLK CLK CLK schedule the instructions into issue slots and sw $t1, 0($s2) WE3 addi $s2, $s2, 4 0 3 compute IPC. A RD A1 WE addi $s1, $s1, -1 1 RD1 0 32 A RD 1 Solution. IPC = (1 + 6*10) / (1 + 4*10) = 1.487… bne $s1, $0, L1 PCnext A2 RD2 0 ALU 32 (max 2) Cycle A3 1 Slot 1 Slot 2 Memory

Data Reschedule to Instruction Instruction WD3 Memory avoid stalls addi $s1, $0, 10 1 WD 32 because of lw L1 lw $t0, 0($s2) 2 20:16

0

+ addi $s2, $s2, 4 addi $s1, $s1, -1 3 15:11 1 4 <<2 ori $t1, $t0, 7 4

Sign Extend + 32 bne $s1, $0, L1 sw $t1, -4($s2) 5 Empty cells means no-ops

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 25 26 Dynamic Multiple Issue Processors (1/2) Dynamic Multiple Issue Processors (2/2) Superscalar Processors Out-of-Order Execution, RAW, WAR

In many modern processors (e.g., Intel Core i7), instruction issuing If the can reorder the instruction execution is performed dynamically by the processor while executing the order, it is an out-of-order execution processor. program. Such processor is called superscalar. The first unit fetches Example of a Read After Write (RAW) and decodes several dependency (dependency on $t0). The Instruction Fetch instruction in-order superscalar pipeline must make sure that and Decode unit lw $t0, 0($s2) Reservation Stations (RS) buffer the data is available before read. operands to the FU before addi $t1, $t0, 7 execution. Data dependencies Example of a Write After Read (WAR) RS RS RS RS are handled dynamically. dependency (dependency on $t1). If the sub $t0, $t1, $s0 order is flipped due to out-of-order Functional Units (FU) execute execution, we have a hazard. the instruction. Examples are addi $t1, $s0, 10 WAR dependencies can be resolved using register FU FU FU FU integer units, floating point units, renaming, where the processor writes to a and load/store units. nonarchitectural renaming register (here in the pseudo The commit unit commits instructions asm code called $r1, not accessible to the programmer) Results are sent back to (stores in registers) conservatively, addi $r1, $s0, 10 RS (if a unit waits on the Commit respecting the observable order of sub $t0, $t1, $s0 Note that out-of-order processor is not rewriting the code, but operand) and to the Unit instructions. Called in-order commit. keeps track of the renamed registers during execution. commit unit. Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

27 28 Some observations on Multi-Issue Processors Intel Microprocessors, some examples

Pipeline Issue Cores Power Processor Year Clock Rate Stages Width VLIW processors tend to be more energy efficient Intel 486 1989 25 MHz 1 5 W than superscalar out-of-order processors (less 5 1 Intel Pentium 1993 66 MHz 5 2 1 10W hardware, the compiler does the job) Multi Issue Intel Pentium Pro 1997 200 MHz 10 3 1 29 W Processors Intel Pentium 4 Willamette 2001 2000 MHz 22 3 1 75W Superscalar processors with dynamic scheduling Intel Pentium 4 Prescott 2004 3600 MHz 31 3 1 103W can hide some latencies that are not statically Intel Core 2006 2930 MHz 14 4 2 75W predictable (e.g., cache misses, dynamic branch Intel Core i5 Nehalem 2010 3300 MHz 14 4 1 87W predictions). Intel Core i5 Ivy Bridge 2012 3400 MHz 14 4 8 77W

Although modern processors issues 4 to 6 Pipeline stages first increased Clock rate increase stopped The power peaked instructions per clock cycle, few applications results and then decreased, but the with Pentium 4 in IPC over 2. The reason is dependencies. (the power wall) around 2006 number of cores increased after 2006. Source: Patterson and Hennessey, 2014, page 344.

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 29 30 Time-Aware Systems - Examples

Cyber-Physical Systems (CPS) Bonus Part

Time-Aware Systems Design

Research Challenges Process Industry and Automotive Aircraft David Broman @ KTH Industrial Automation (not part of the Examination)

Time-Aware Simulation Systems Time-Aware Distributed Systems

Time-stamped Physical simulations distributed systems (Simulink, Modelica, etc.) (E.g. Google Spanner)

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

31 32 What is our goal? Programming Model and Time

“Everything should be made as simple as possible, Timing is not part of the software semantics but not simpler“ Correct execution of programs (e.g., in C, C++, C#, Java, Scala, Haskell, attributed to Albert Einstein OCaml) has nothing to do with how long time things takes to execute. Traditional Approach Execution time should be as short as possible, but not shorter Programming Model Objective: Minimize the slack Minimize area, memory, Timing Dependent on the No point in making the energy. Our Objective Hardware Platform execution time shorter, as long as the deadline is met. Challenge: Programming Still guarantee to meet Model Deadline all timing constraints.

Task Slack Make time an abstraction within the Timing is independent of the hardware programming model platform (within certain constraints)

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism 33 34 Are you interested to be challenged?

If you are interested in • Programming language design, or You know the drill… • Compilers, or …keep calm •

You are • ambitious and interested in learning new things You want to • do a real research project as part of you Bachelor’s or Master’s thesis project Please send an email to [email protected], so that we can discuss some ideas.

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism

35 36

Reading Guidelines Summary

Module 6: Parallel Processors and Programs Some key take away points:

Lecture 12: Parallelism, Concurrency, Speedup, and ILP • Amdahl’s law can be used to estimate maximal • H&H Chapter 1.8, 3.6, 7.8.3-7.8.5 speedup when introducing parallelism in parts of a • P&H5 Chapters 1.7-1.8, 1.10, 4.10, 6.1-6.2 program. or P&H4 Chapters 1.5-1.6, 1.8, 4.10, 7.1-7.2

Lecture 13: SIMD, MIMD, and Parallel Programming • Instruction Level Parallelism (ILP) has been very • H&H Chapter 7.8.6-7.8.9 important for performance improvements over the years, • P&H5 Chapters 2.11, 3.6, 5.10, 6.3-6.7 but improvements have not been as significant lately. or P&H4 Chapters 2.11, 3,6, 5.8, 7.3-7.7 Reading Guidelines See the course webpage for more information. Thanks for listening!

Part I Part II Part I Part II David Broman Multiprocessors, Parallelism, Instruction Level David Broman Multiprocessors, Parallelism, Instruction Level [email protected] Concurrency, and Speedup Parallelism [email protected] Concurrency, and Speedup Parallelism