CPU Architecture: Instruction-Level Parallelism

Lecture 2: Pipelining and Instruction-Level Parallelism 15-418 Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2019 CMU 15-418/15-618, Spring 2019 Many kinds of processors CPU GPU FPGA Etc. Why so many? What differentiates these processors? CMU 15-418/15-618, Spring 2019 Why so many kinds of processors? Each processor is designed for different kinds of programs ▪ CPUs ▪ “Sequential” code – i.e., single / few threads ▪ GPUs ▪ Programs with lots of independent work “Embarrassingly parallel” ▪ Many others: Deep neural networks, Digital signal processing, Etc. CMU 15-418/15-618, Spring 2019 Parallelism pervades architecture ▪ Speeding up programs is all about parallelism ▪ 1) Find independent work ▪ 2) Execute it in parallel ▪ 3) Profit ▪ Key questions: ▪ Where is the parallelism? ▪ Whose job is it to find parallelism? CMU 15-418/15-618, Spring 2019 Where is the parallelism? Different processors take radically different approaches ▪ CPUs: Instruction-level parallelism ▪ Implicit ▪ Fine-grain ▪ GPUs: Thread- & data-level parallelism ▪ Explicit ▪ Coarse-grain CMU 15-418/15-618, Spring 2019 Whose job to find parallelism? Different processors take radically different approaches ▪ CPUs: Hardware dynamically schedules instructions ▪ Expensive, complex hardware Few cores (tens) ▪ (Relatively) Easy to write fast software ▪ GPUs: Software makes parallelism explicit ▪ Simple, cheap hardware Many cores (thousands) ▪ (Often) Hard to write fast software CMU 15-418/15-618, Spring 2019 Visualizing these differences ▪ Pentium 4 “Northwood” (2002) CMU 15-418/15-618, Spring 2019 Visualizing these differences ▪ Pentium 4 “Northwood” (2002) ▪ Highlighted areas actually execute instructions Most area spent on scheduling (not on executing the program) CMU 15-418/15-618, Spring 2019 Visualizing these differences ▪ AMD Fiji (2015) CMU 15-418/15-618, Spring 2019 Visualizing these differences ▪ AMD Fiji (2015) ▪ Highlighted areas actually execute instructions Most area spent executing the program ▪ (Rest is mostly I/O & memory, not scheduling) CMU 15-418/15-618, Spring 2019 Today you will learn… How CPUs exploit ILP to speed up straight-line code ▪ Key ideas: ▪ Pipelining & Superscalar: Work on multiple instructions at once ▪ Out-of-order execution: Dynamically schedule instructions whenever they are “ready” ▪ Speculation: Guess what the program will do next to discover more independent work, “rolling back” incorrect guesses ▪ CPUs must do all of this while preserving the illusion that instructions execute in-order, one-at-a-time CMU 15-418/15-618, Spring 2019 In other words… Today is about: CMU 15-418/15-618, Spring 2019 Buckle up! …But please ask questions! CMU 15-418/15-618, Spring 2019 Example: Polynomial evaluation int poly(int *coef, int terms, int x) { int power = 1; int value = 0; for (int j = 0; j < terms; j++) { value += coef[j] * power; power *= x; } return value; } CMU 15-418/15-618, Spring 2019 r0: value r1: &coef[terms] Example: r2: x r3: &coef[0] Polynomial evaluation r4: power r5: coef[j] ▪ Compiling on ARM poly: cmp r1, #0 ble .L4 int poly(int *coef, push {r4, r5} int terms, int x) { mov r3, r0 add r1, r0, r1, lsl #2 int power = 1; movs r4, #1 int value = 0; movs r0, #0 .L3: for (int j = 0; j < terms; j++) { ldr r5, [r3], #4 value += coef[j] * power; cmp r1, r3 mla r0, r4, r5, r0 power *= x; mul r4, r2, r4 } bne .L3 return value; pop {r4, r5} bx lr } .L4: movs r0, #0 CMU 15-418/15-618, Spring 2019 bx lr r0: value r1: &coef[terms] Example: r2: x r3: &coef[0] Polynomial evaluation r4: power r5: coef[j] ▪ Compiling on ARM poly: cmp r1, #0 ble .L4 int poly(int *coef, push {r4, r5} Preamble int terms, int x) { mov r3, r0 add r1, r0, r1, lsl #2 int power = 1; movs r4, #1 int value = 0; movs r0, #0 .L3: for (int j = 0; j < terms; j++) { ldr r5, [r3], #4 value += coef[j] * power; cmp r1, r3 mla r0, r4, r5, r0 power *= x; mul r4, r2, r4 Iteration } bne .L3 return value; pop {r4, r5} bx lr } .L4: movs r0, #0 Fini CMU 15-418/15-618, Spring 2019 bx lr r0: value r1: &coef[terms] Example: r2: x r3: &coef[j] Polynomial evaluation r4: power r5: coef[j] ▪ Compiling on ARM for (int j = 0; j < terms; j++) { value += coef[j] * power; power *= x; } .L3: ldr r5, [r3], #4 // r5 <- coef[j]; j++ (two operations) cmp r1, r3 // compare: j < terms? mla r0, r4, r5, r0 // value += r5 * power (mul + add) mul r4, r2, r4 // power *= x bne .L3 // repeat? CMU 15-418/15-618, Spring 2019 Example: Polynomial evaluation ▪ Executing poly(A, 3, x) cmp r1, #0 ble .L4 push {r4, r5} mov r3, r0 add r1, r0, r1, lsl #2 movs r4, #1 movs r0, #0 ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 ... CMU 15-418/15-618, Spring 2019 Example: Polynomial evaluation ▪ Executing poly(A, 3, x) cmp r1, #0 ble .L4 push {r4, r5} mov r3, r0 Preamble add r1, r0, r1, lsl #2 movs r4, #1 movs r0, #0 ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 iteration J=0 ... CMU 15-418/15-618, Spring 2019 Example: Polynomial evaluation ▪ Executing poly(A, 3, x) cmp r1, #0 ... ble .L4 ldr r5, [r3], #4 push {r4, r5} cmp r1, r3 mov r3, r0 Preamble mla r0, r4, r5, r0 add r1, r0, r1, lsl #2 mul r4, r2, r4 movs r4, #1 bne .L3 iteration J=1 movs r0, #0 ldr r5, [r3], #4 ldr r5, [r3], #4 cmp r1, r3 cmp r1, r3 mla r0, r4, r5, r0 mla r0, r4, r5, r0 mul r4, r2, r4 mul r4, r2, r4 bne .L3 iteration J=2 bne .L3 iteration J=0 pop {r4, r5} ... bx lr Fini CMU 15-418/15-618, Spring 2019 Example: Polynomial evaluation ▪ Executing poly(A, 3, x) cmp r1, #0 ... ble .L4 ldr r5, [r3], #4 push {r4, r5} cmp r1, r3 mov r3, r0 Preamble mla r0, r4, r5, r0 add r1, r0, r1, lsl #2 mul r4, r2, r4 movs r4, #1 bne .L3 iteration J=1 movs r0, #0 ldr r5, [r3], #4 ldr r5, [r3], #4 cmp r1, r3 cmp r1, r3 mla r0, r4, r5, r0 mla r0, r4, r5, r0 mul r4, r2, r4 mul r4, r2, r4 bne .L3 iteration J=2 bne .L3 iteration J=0 pop {r4, r5} ... bx lr Fini CMU 15-418/15-618, Spring 2019 The software-hardware boundary ▪ The instruction set architecture (ISA) is a functional contract between hardware and software ▪ It says what each instruction does, but not how ▪ Example: Ordered sequence of x86 instructions ▪ A processor’s microarchitecture is how the ISA is implemented Arch : 휇Arch :: Interface : Implementation CMU 15-418/15-618, Spring 2019 Simple CPU model ▪ Execute instructions in program order ▪ Divide instruction execution into stages, e.g.: ▪ 1. Fetch – get the next instruction from memory ▪ 2. Decode – figure out what to do & read inputs ▪ 3. Execute – perform the necessary operations ▪ 4. Commit – write the results back to registers / memory ▪ (Real processors have many more stages) CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 ldr cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 1. Read “ldr r5, [r3] #4” from memory ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 ldr cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 2. Decode “ldr r5, [r3] #4” and read input regs ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 ldr cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 3. Load memory at r3 and compute r3 + 4 ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 ldr cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 4. Write values into regs r5 and r3 ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 cmp cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 ... CMU 15-418/15-618, Spring 2019 Evaluating polynomial on the simple CPU model ldr r5, [r3], #4 cmp r1, r3 mla r0, r4, r5, r0 CPU mul r4, r2, r4 bne .L3 Fetch Decode Execute Commit ldr r5, [r3], #4 cmp cmp r1, r3 mla r0, r4, r5, r0 mul r4, r2, r4 bne .L3 ..

CPU Architecture: Instruction-Level Parallelism

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support