Introduction to Intel Cilk

Introduction to Intel Cilk Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2014/3/5 1 Agenda • Cilk Keywords • Load Balancing • Reducer • Summary Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 3/5/2014 2 Cilk keywords • Cilk adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for • If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_spawn and cilk_sync • cilk_spawn (or _Cilk_spawn) gives the runtime permission to run a child function asynchronously. – No 2nd thread is created or required! – If there are no available workers, then the child will execute as a serial function call. – The scheduler may steal the parent and run it in parallel with the child function. – The parent is not guaranteed to run in parallel with the child. • cilk_sync (or _Cilk_sync) waits for all children to complete. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Anatomy of a spawn void f() void g() { spawn { cilk_spawn g(); work spawned work work function work continuation work (child) work } cilk_sync; spawning spawning work sync }(parent) function Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work Stealing when another worker is available void f() void g() { { cilk_spawn g(); work work steal! work work work work } cilk_sync; work } Worker Worker A B Worker ? Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Load Balancing •The work-stealing scheduler automatically load-balances: –An idle worker will find work to do. –If the program has enough parallelism, then all workers will stay busy. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Quicksort Example void qsort(int* begin, int* end) { if (begin != end) { int* pivot = end – 1; int* middle = std::partition(begin, pivot, std::bind2nd(std::less<int>(), *pivot)); using std::swap; swap(*pivot, *middle); // move pivot to middle cilk_spawn qsort(begin, middle); qsort(middle + 1, end); cilk_sync; } } divide-and conquer asynchronous recursion Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallelism • Experiments show that using qsort on 100,000,000 integers will get linear speedup up to about 7 processors. • Why doesn’t the speedup continue to 8 or more processors? • qsort has only enough parallelism to keep 7 processors busy. – The spawned recursion adds parallelism but… – The serial partition increases the span • Formally: parallelism = the total work divided by the work within the longest serial path (span). Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work-stealing Overheads • Spawning is cheap (3-5 times the cost of a function call) – Spawn early and often. – Optimal scheduling requires that parallelism be about an order of magnitude greater than the actual number of cores. • Stealing is much more expensive (requires locks and memory barriers) • Most spawns do not result in steals. • The more balanced the work load, the less stealing there is and hence the less overhead. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for loop • Looks like a normal for loop. cilk_for (int x = 0; x < 1000000; ++x) { … } • Any or all iterations may execute in parallel with one another. • All iterations complete before program continues. • Constraints: – Limited to a single control variable. – Must be able to jump to the start of any iteration at random. – Iterations should be independent of one another. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implementation of cilk_for cilk_for (int i=0; i< 8; ++i) f(i); 0 - 7 spawn continuation 0 - 3 4 - 7 spawn continuation spawn continuation 0 - 1 2 - 3 4 - 5 6 - 7 0 1 2 3 4 5 6 7 Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for vs. serial for with spawn • Compare the following loops: for (int x = 0; x < n; ++x) { cilk_spawn f(x); } cilk_for (int x = 0; x < n; ++x) { f(x); } • The above two loops have similar semantics, but… • they have very different performance characteristics. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial for with spawn: unbalanced steal! 0 - 7 Worker Worker A B spawn 1 - 7 steal! spawn steal! 0 2 - 7 steal! steal! 1 steal! 3 - 7 4 - 7 5 - 7 steal! 6 - 7 If work per 2 iteration is small 3 7 - 7 4 then steal overhead can 5 6 be significant 7 Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for: Divide and Conquer Worker Worker A 0 - 7 B steal! spawn 0 - 3 4 - 7 spaw n 0 - 1 2 - 3 4 - 5 6 - 7 return 0 1 2 3 4 5 6 7 Divide and conquer results if few steals and less overhead. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for examples • cilk_for (int x; x < 1000000; x += 2) { … } • cilk_for (vector<int>::iterator x = y.begin(); x != y.end(); ++x) { … } • cilk_for (list<int>::iterator x = y.begin(); x != y.end(); ++x) { … } – Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.) – Do not have random access to the elements of the list. (y.begin() + n is not defined.) Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Grain Size • If the work per iteration of a cilk_for is sufficiently small, even the spawn overhead can become noticeable. • To reduce the overhead, cilk_for chunks the loop into “grains.” • The default grain size will yield good performance in most cases. – Default grain size heuristic: N/8p, where N is the number of loop iterations and p is the number of workers. – This heuristic was produces sufficient parallel slackness for the work- stealing scheduler on loops that are not radically unbalanced. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. #pragma cilk grainsize • The programmer may choose a grain size explicitly: #pragma cilk grainsize = expression cilk_for (…) • Pragma is most useful for setting the grain size to 1 for large, unbalanced loops. • If grainsize is set too small for short loops, spawn overhead reduces performance. • If grainsize is set too large, parallelism will be lost. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serialization • Every Cilk program has an equivalent serial program called the serialization • The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for – The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows) or -cilk- serialize (Linux) • Running with only one worker is equivalent to running the serialization. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial Semantics • A deterministic Cilk program will have the same semantics as its serialization. – Easier regression testing – Easier to debug: – Run with one core – Run serialized – Composable – Strong analysis tools (Cilk-specific versions will be posted on WhatIf) – race detector – parallelism analyzer Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implicit syncs void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } At end of a cilk_for body (does not sync g()) try { cilk_spawnBefore h(); entering a try block containing a sync } catch (...) At{ end of a try block containing a spawn ... } } At end of a spawning function Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Library •A variable that can be safely used by multiple strands running in parallel. •Cilk’s hyperobject library contains many commonly used reducers: • reducer_list_append, reducer_list_prepend, • reducer_maxint, main(reducer_max_indexint argc, char* argv[]) • reducer_min{ , reducer_min_index • reducer_opadd , reducer_ostreamunsigned int n = ,1000000; reducer_basic_string cilk…:: reducer_opadd<unsigned

Load more