Parallel Programming in Intel Cilk Plus with Autotuning

Parallel Programming in Intel® CilkTM Plus with Autotuning William Leiserson, Chi-Keung (CK) Luk, Peng Tu Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 1 Notice Tutorial Outline • Introduction (15 mins) // CK • Cilk (1 hr) // Will • Array Notation-Part 1 (30 mins) // Peng • Break (15 mins) • Array Notation-Part 2 (30 mins) // Peng • Autotuning (20 mins) // CK • Put it All Together (40 mins) // CK Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 2 Notice A New Era… THE NEW IPC Focus THE OLD Multi-Core Power Efficiency Frequency Microarchitecture Focus Advancements Unconstrained Power Voltage Scaling Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Use Transistors, Not Cycle-time for Performance Parallel performance: Parallel technologies P ~ Area fill gap: • Multi-core • Threading • SIMD Performance Sequential Performance: P ~ √Area Transistor Count (area) Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 4 Notice Multi-core: Multiple Cores on One Die Power delivery and management Cores High bandwidth memory – Performance scales with number of cores Large cache Scalable fabric On-die Interconnect – High bandwidth Core Core – Low latency Core Core Cache hierarchy – Private and shared – Inclusive and Core Core exclusive Scalability Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 5 Notice Intel® Hyper-Threading Technology: Fine-Grained Interleaving of Threads Instruction Issue Slots Two threads compete Intel® HTT for issue slots Time Intel® HTT in Nehalem. Run 2 threads at the same time in a core. Exploit 4-wide execution engine. Hyperthreading is a power efficient performance feature. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 6 Notice SIMD Instructions Compute Multiple Operations per Instruction 255 128 127 0 X4 X3 X2 X1 X4 X3 X2 X1 Y4 Y3 Y2 Y1 Y4 Y3 Y2 Y1 X4opY4 X3opY3 X2opY2 X1opY1 X4opY4 X3opY3 X2opY2 X1opY1 Intel® Advanced Vector Extensions in Sandy Bridge Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 7 Notice Intel® Sandy Bridge Processor: 4 IA cores + 1 GPU per die 2 256b SIMD pipelines per IA core 2 Intel® Hyper-Threading Technology threads per IA core Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 8 Notice st 1 MICSoftware implementation & Services Group, Developer (KNF) Products Division has 32 cores, each with Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 9 4 threads, and a 512-bit SIMD unit Notice Intel Parallel Programming Model Multiple Tasks cores Intel® Cilk Hardware Intel® Threading Building Blocks threads 255 128127 0 SIMD X4 X3 X2 X1 X4 X3 X2 X1 Vectors instructions Y4 Y3 Y2 Y1 Y4 Y3 Y2 Y1 Array Notation Intel® Array X4opY4 X3opY3 X2opY2 X1opY1 X4opY4 X3opY3 X2opY2 X1opY1 Building Block Parallel tasks with SIMD kernels Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 10 Notice Task Parallel Programming Tasks are pieces of work at the application level – Loop bodies, functions Scalable. Specify potential parallelism. Scheduler maps tasks to threads Task graph dynamically created and executed Supported in Intel Cilk and TBB Scalable parallelism at the application level Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 11 Notice Intel® Cilk™ Plus: One of the Intel® Parallel Building Blocks Software & Services Group, Developer Products Division 12 Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Intel® CilkTM Plus • Cilk Plus = Cilk + Array Notation • A new feature in the Intel Compiler version 12 • A simple and effective approach to exploiting both task and vector parallelism • Work for both Intel mainstream processors and Many-Integrated-Core (MIC) accelerator Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 13 Notice Outline 1. Cilk Syntax and Concepts 2. Scalability Analysis 3. Race Detection and Correction Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 14 Notice Outline 1. Cilk Syntax and Concepts 2. Scalability Analysis 3. Race Detection and Correction Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 15 Notice Cilk Plus • Small set of linguistic extensions to C/C++ to support fork-join parallelism. • Based on Cilk, developed at MIT, and Cilk++, developed at Cilk Arts. • Uses a provably efficient work-stealing scheduler. • Offers hyperobjects as a lock-free mechanism to resolve race conditions. • The Cilkscreen race detector and Cilkview scalability analyzer are available as a free download from the Intel website. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 16 Notice Fibonacci Numbers The Fibonacci numbers are the sequence 〈0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …〉, where each number is the sum of the previous two. Recurrence: F0 = 0, F1 = 1, Fn = Fn–1 + Fn–2 for n > 1. The sequence is named after Leonardo di Pisa (1170– 1250 A.D.), also known as Fibonacci, a contraction of filius Bonaccii —“son of Bonaccio.” Fibonacci’s 1202 book Liber Abaci introduced the sequence to Western mathematics, although it had previously been discovered by Indian mathematicians. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 17 Notice Fibonacci Program #include <stdio.h> #include <stdlib.h> Disclaimer int fib(int n) This recursive program is a poor { if (n < 2) return n; way to compute the nth int x = fib(n-1); Fibonacci number, but it provides int y = fib(n-2); return x + y; a good didactic example. } int main(int argc, char *argv[]) { int n = atoi(argv[1]); int result = fib(n); printf("Fibonacci of %d is %d.\n", n, result); return 0; } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 18 Notice Fibonacci Execution fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(0) fib(1) fib(0) int fib(int n) { Key idea for parallelization if (n < 2) return n; The calculations of fib(n-1) and int x = fib(n-1); fib(n-2) can be executed int y = fib(n-2); return x + y; simultaneously without mutual } interference. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 19 Notice Serial Fibonacci Function in C++ int fib(int n) { if (n < 2) return n; int x = fib(n-1); int y = fib(n-2); return x+y; } Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 20 Notice Nested Parallelism in Cilk Plus The named child function may execute in int fib(int n) parallel with the parent { caller. if (n < 2) return n; int x = cilk_spawn fib(n-1); int y = fib(n-2); cilk_sync; return x+y; Control cannot pass this } point until all spawned children have returned. Cilk keywords grant permission for parallel execution. They do not command parallel execution. Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization 21 Notice Work-stealing task scheduler Each processor has a work queue of spawned tasks spawn spawn spawn spawn spawn spawn spawn spawn spawn spawn spawn spawn spawn Spawn! P P P P When each processor has work to do, a spawn is roughly the cost of a function call. Software

Load more