<<

Introduction to

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2014/3/5 1 Agenda

• Cilk Keywords • Load Balancing • Reducer • Summary

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 3/5/2014 2 Cilk keywords

• Cilk adds three keywords to and C++: _Cilk_spawn _Cilk_sync _Cilk_for

• If you #include , you can write the keywords as cilk_spawn, cilk_sync, and cilk_for.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_spawn and cilk_sync

• cilk_spawn (or _Cilk_spawn) gives the runtime permission to run a child function asynchronously. – No 2nd is created or required! – If there are no available workers, then the child will execute as a serial function call. – The scheduler may steal the parent and run it in parallel with the child function. – The parent is not guaranteed to run in parallel with the child. • cilk_sync (or _Cilk_sync) waits for all children to complete.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Anatomy of a spawn

void f() void g() { spawn { cilk_spawn g(); work work spawned work function work continuation work (child) work } cilk_sync; spawning spawning work sync }(parent) function

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. when another worker is available void f() void g() { { cilk_spawn g(); work work steal! work work work work } cilk_sync; work } Worker Worker A B Worker ?

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Load Balancing

•The work-stealing scheduler automatically load-balances: –An idle worker will find work to do. –If the program has enough parallelism, then all workers will stay busy.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Quicksort Example

void qsort(int* begin, int* end) { if (begin != end) { int* pivot = end – 1; int* middle = std::partition(begin, pivot, std::bind2nd(std::less(), *pivot)); using std::swap; swap(*pivot, *middle); // move pivot to middle cilk_spawn qsort(begin, middle); qsort(middle + 1, end); cilk_sync; } } divide-and conquer asynchronous recursion

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallelism

• Experiments show that using qsort on 100,000,000 integers will get linear up to about 7 processors. • Why doesn’t the speedup continue to 8 or more processors? • qsort has only enough parallelism to keep 7 processors busy. – The spawned recursion adds parallelism but… – The serial partition increases the span • Formally: parallelism = the total work divided by the work within the longest serial path (span).

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work-stealing Overheads

• Spawning is cheap (3-5 times the cost of a function call) – Spawn early and often. – Optimal requires that parallelism be about an order of magnitude greater than the actual number of cores. • Stealing is much more expensive (requires locks and memory barriers) • Most spawns do not result in steals. • The more balanced the work load, the less stealing there is and hence the less overhead.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for loop

• Looks like a normal for loop. cilk_for (int x = 0; x < 1000000; ++x) { … } • Any or all iterations may execute in parallel with one another. • All iterations complete before program continues. • Constraints: – Limited to a single control variable. – Must be able to jump to the start of any iteration at random. – Iterations should be independent of one another.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implementation of cilk_for cilk_for (int i=0; i< 8; ++i) f(i);

0 - 7 spawn continuation

0 - 3 4 - 7 spawn continuation spawn continuation

0 - 1 2 - 3 4 - 5 6 - 7

0 1 2 3 4 5 6 7

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for vs. serial for with spawn

• Compare the following loops:

for (int x = 0; x < n; ++x) { cilk_spawn f(x); }

cilk_for (int x = 0; x < n; ++x) { f(x); }

• The above two loops have similar semantics, but… • they have very different performance characteristics.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial for with spawn: unbalanced

steal! 0 - 7 Worker Worker A B spawn 1 - 7 steal!

spawn steal! 0 2 - 7 steal! steal! 1 steal! 3 - 7 4 - 7 5 - 7 steal! 6 - 7 If work per 2 iteration is small 3 7 - 7 4 then steal overhead can 5 6 be significant 7

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for: Divide and Conquer

Worker Worker A 0 - 7 B steal! spawn

0 - 3 4 - 7 spaw n 0 - 1 2 - 3 4 - 5 6 - 7

return 0 1 2 3 4 5 6 7

Divide and conquer results if few steals and less overhead.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for examples

• cilk_for (int x; x < 1000000; x += 2) { … }

• cilk_for (vector::iterator x = y.begin(); x != y.end(); ++x) { … }

• cilk_for (list::iterator x = y.begin(); x != y.end(); ++x) { … }

– Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.) – Do not have random access to the elements of the list. (y.begin() + n is not defined.)

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Grain Size

• If the work per iteration of a cilk_for is sufficiently small, even the spawn overhead can become noticeable.

• To reduce the overhead, cilk_for chunks the loop into “grains.”

• The default grain size will good performance in most cases. – Default grain size heuristic: N/8p, where N is the number of loop iterations and p is the number of workers. – This heuristic was produces sufficient parallel slackness for the work- stealing scheduler on loops that are not radically unbalanced.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. #pragma cilk grainsize

• The may choose a grain size explicitly: #pragma cilk grainsize = expression cilk_for (…) • Pragma is most useful for setting the grain size to 1 for large, unbalanced loops. • If grainsize is set too small for short loops, spawn overhead reduces performance. • If grainsize is set too large, parallelism will be lost.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serialization

• Every Cilk program has an equivalent serial program called the serialization • The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for – The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows) or -cilk- serialize (Linux) • Running with only one worker is equivalent to running the serialization.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial Semantics

• A deterministic Cilk program will have the same semantics as its serialization. – Easier regression testing – Easier to debug: – Run with one core – Run serialized – Composable – Strong analysis tools (Cilk-specific versions will be posted on WhatIf) – race detector – parallelism analyzer

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implicit syncs void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } At end of a cilk_for body (does not sync g()) try { cilk_spawnBefore h(); entering a try block containing a sync } catch (...) At{ end of a try block containing a spawn ... } }

At end of a spawning function Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Library •A variable that can be safely used by multiple strands running in parallel.

•Cilk’s hyperobject library contains many commonly used reducers: • reducer_list_append, reducer_list_prepend, • reducer_maxint, main(reducer_max_indexint argc, char* argv[]) • reducer_min{ , reducer_min_index • reducer_opadd , reducer_ostreamunsigned int n = ,1000000; reducer_basic_string cilk…:: reducer_opadd total; cilk_for(unsigned int i = 1; i <= n; ++i) { •You can also write your totalown += using compute( i); } cilk::monoid_base} and cilk::reducer.

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Views

1. At a spawn, the child receives the parent’s view.

2. After a spawn, the continuation receives either the view from before the spawn or a new view initialized to the identity, nondeterministically, depending on whether the continuation was stolen.

3. At (or before) a sync, the views of the child and parent are reduced (merged).

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Views when continuation is stolen cilk::reducer_opadd sum(3); 3 void f() void g() { { initial view cilk_spawn g();identity work work sum++; 4 0 sum += 2; steal! work work }2 reduce cilk_sync; work 6 } Worker Worker A B Worker ?

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary • Cilk is a simple extension to C and C++ for shared- memory fork-join parallelism. • Cilk’s serial semantics and simple syntax does not obscure the program logic. • Cilk’s work-stealing scheduler automatically load- balances if there sufficient parallelism. • Cilk is suitable for both loop and recursive (divide an conquer) parallelism. • Cilk is an excellent choice for parallelizing both new and legacy C and C++ software. Intel Parallel Composer XE Beta program registration • http://intelsoftwareproductsurvey.com/survey/149826/100c Download • https://registrationcenter.intel.com

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2009. Intel Corporation.

http://intel.com/software/products

Software AND Services

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 26